Re: Merge tool based on mergefactor

2013-06-19 Thread Cosimo Streppone
On 06/19/2013 03:21 AM, Otis Gospodnetic wrote:

 You could call the optimize command directly on slaves, but specify
 the target number of segments, e.g.
 /solr/update?optimize=truemaxSegments=10
 
 Not sure I recommend doing this on slaves, but you could - maybe you
 have spare capacity.  You may also want to consider not doing it on
 all your slaves at the same time...

IIUC this assumes your slaves do not replicate too often,
otherwise replication would reset the index to whatever
number of segments the master has.

You could still perform an optimize with maxSegments after
every replication, if it's acceptable in the situation
you are in.

However, if you need slaves to update every 2-5 minutes,
that would be impractical and wasteful.

Is this correct?

If so, how to find a fair compromise/balance between
master and slave merge factors if you need very frequent indexing
of new documents (say continuous) on the master and up-to-date
indexes on the slaves (say 2-5' pollInterval)?

-- 
Cosimo


Adding documents in Solr plugin

2013-06-19 Thread Avner Levy
I have a core with millions of records.
I want to add a custom handler which scan the existing documents and update one 
of the field  (delete and add document) based on a condition (age12 for 
example).
All fields are stored so there is no problem to recreate the document from the 
search result.
I prefer doing it on the Solr server side for avoiding sending millions of 
documents to the client and back.
I'm thinking of writing a solr plugin which will receive a query and update 
some fields on the query documents (like the delete by query handler).
Are existing solutions or better alternatives?
I couldn't find any examples of Solr plugins which update / add / delete 
documents (I don't need to extend the update handler).
If someone has an example it will be great help.
Thanks in advance


Re: Adding documents in Solr plugin

2013-06-19 Thread Upayavira
This could be a very useful feature. To do it properly, you'd want some
new update syntax, extending that of the atomic updates. That is, a new
custom request handler could do it, but might now be the best way.

If I were to try this, I'd look into the atomic update tickets in JIRA
and see what code they touched. See if you can find a way to add
something there. 

Upayavira

On Wed, Jun 19, 2013, at 08:52 AM, Avner Levy wrote:
 I have a core with millions of records.
 I want to add a custom handler which scan the existing documents and
 update one of the field  (delete and add document) based on a condition
 (age12 for example).
 All fields are stored so there is no problem to recreate the document
 from the search result.
 I prefer doing it on the Solr server side for avoiding sending millions
 of documents to the client and back.
 I'm thinking of writing a solr plugin which will receive a query and
 update some fields on the query documents (like the delete by query
 handler).
 Are existing solutions or better alternatives?
 I couldn't find any examples of Solr plugins which update / add / delete
 documents (I don't need to extend the update handler).
 If someone has an example it will be great help.
 Thanks in advance


Disable Replication for all Cores in a single Command

2013-06-19 Thread Ralf Heyde

Hello Folks,

is it possible to disable the replication for ALL cores using one 
command? We currently use Solr 3.6.


Currently we have a curl operation, which fires:
http://slave_host:port/solr/core/admin/replication/index.jsp?poll=disable

In the documentation there is a URL-Command which seems to be correct, 
but it says 404.

http://slave_host:port/solr/replication?command=disablepoll

Since we have many cores and many Servers, this takes a while.

Thanks, Ralf


UnInverted multi-valued field

2013-06-19 Thread Jochen Lienhard

Hi @all.

We have the problem that after an update the index takes to much time 
for 'warm up'.


We have some multivalued facet-fields and during the startup solr 
creates the messages:


INFO: UnInverted multi-valued field 
{field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0}



In the solconfig we use the facet.method 'fc'.
We know, that the start-up with the method 'enum' is faster, but then 
the searches are very slow.


How do you handle this problem?
Or have you any idea for optimizing the warm up?
Or what do you do after an update?

Greetings

Jochen

--
Dr. rer. nat. Jochen Lienhard
Dezernat EDV

Albert-Ludwigs-Universität Freiburg
Universitätsbibliothek
Rempartstr. 10-16  | Postfach 1629
79098 Freiburg | 79016 Freiburg

Telefon: +49 761 203-3908
E-Mail: lienh...@ub.uni-freiburg.de
Internet: www.ub.uni-freiburg.de



Re: Solr string field stripping new lines line breaks

2013-06-19 Thread sodoo
Dears,

My english is bad. But I will try to explain. 

I have indexed databases and files. The files included : docx, pdf, txt.
Then I have indexed all of data.
But my indexed document  pdf files text all of through continued. 

I try to appear line break text. 
Document files text line breaks to indexed document also line breaks. 

My frontend app is SOLARIUM. 

How can I appear line break the indexed data?
Please assist me on this.

Thank you



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-string-field-stripping-new-lines-line-breaks-tp3984384p4071595.html
Sent from the Solr - User mailing list archive at Nabble.com.


getting different search results for words with same meaning in Japanese language

2013-06-19 Thread Yash Sharma
Hi,

we have two japanese words with the same meaning ソフトウェア and ソフトウエア (notice
the difference in capital I looking character - word meaning is 'software'
in the english language). When ソフトウェア is searched, it gives around 8 search
results but when ソフトウエア is searched, it gives only 2 search results.

The japanese translator told that this is something called yugari (which
means that the above words can be seen as authorise and authorize, so they
should yield same search results as they have same meaning but spelled
differently).

we have one solution to this issue - to use synonyms.txt and place all
these similar words in this text file. This solved our problem to some
extent but, in real time scenario, we do not have all the japanese
technical words like software, product, technology, and so on and we cannot
keep updating synonyms.txt on a daily basis.

Is there any better solution, so that all the similar japanese words give
same search results ?
Any help is greatly appreciated.

-- 
Regards,

Yash Sharma
Sr. Software Engineer | y...@osscube.com | +91-9873200649

OSSCube Solutions Pvt. Ltd.
Noida A-42/6, Sector-62
Noida-201301 (UP)


Solr Suggest does not work in solrcloud environment

2013-06-19 Thread Sharp
Hi Guys

I am having difficulties running a suggest Search Handler in a solrcloud
environment. The configuration was tested on a standalone machine and works
fine there. 

Here is my configuration:

*Schema.xml*

field name=suggest type=suggest_text indexed=true stored=false
multiValued=true /

copyField source=field1 dest=suggest /
copyField source=field2 dest=suggest /
copyField source=field3 dest=suggest /
...

fieldType name=suggest_text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonym.txt
ignoreCase=true
expand=true /
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopword.txt
enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.KeywordMarkerFilterFactory 
protected=protword.txt
/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopword.txt
enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.KeywordMarkerFilterFactory 
protected=protword.txt
/
/analyzer
/fieldType


*Solrconfig.xml*

searchComponent class=solr.SpellCheckComponent name=suggest
str name=queryAnalyzerFieldTypesuggest_text/str
lst name=spellchecker
str name=namesuggest/str
str 
name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
str name=fieldsuggest/str
float name=threshold0/float
str name=buildOnCommittrue/str
/lst
lst name=spellchecker
str name=namedefault/str
str name=fieldsuggest/str
str name=classnamesolr.DirectSolrSpellChecker/str
str name=distanceMeasureinternal/str
float name=accuracy0.2/float
int name=maxEdits2/int
int name=minPrefix1/int
int name=maxInspections50/int
int name=minQueryLength2/int
float name=maxQueryFrequency0.01/float
/lst
lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldsuggest/str
str name=combineWordstrue/str
str name=breakWordstrue/str
int name=maxChanges10/int
/lst
/searchComponent


requestHandler class=org.apache.solr.handler.component.SearchHandler
name=/suggest
lst name=defaults
str name=spellchecktrue/str
str name=spellcheck.dictionarydefault/str
str name=spellcheck.dictionarywordbreak/str
str name=spellcheck.dictionarysuggest/str
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.count10/str
str name=spellcheck.collatetrue/str
/lst
arr name=components
strsuggest/str
/arr
/requestHandler

As soon as I post a query on
http://url.com:8983/solr/mycore/suggest?q=barwt=json

I get an empty answer

{responseHeader:{status:0,QTime:0}}

No errors or warnings in the log. Any ideas?

Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Suggest-does-not-work-in-solrcloud-environment-tp4071587.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to reterieve all results from lucene searcher.search() method

2013-06-19 Thread neeraj shah
hello,

Is there any way to get all the search result.
In lucene we get top documents by giving the limit like top 100,1000... etc.
but if i want to get all results.

How can I achieve that??

 Query qu  = new QueryParser(Version.LUCENE_36,field,
analyzer).parse(query);

 TopDocs hits = searcher.search(qu,1000);


Re: SOLR Cloud - Disable Transaction Logs

2013-06-19 Thread Erick Erickson
Right, NRT is not tied to cloud, but it is tied to the update log.

And you bring up an interesting issue when you talk about avilibility zones.
SolrCloud is fairly chatty in that all of the nodes need to talk to all the
other nodes in the network and they will. If the nodes are separated by
an expensive connection (however you measure expensive, latency
or cost to use or) then this may well be a bottleneck. For instance,
the leader needs to talk to every one of its followers for an update. Imagine
a leader in zone1 and all 15 replicas in zone2. Now the expensive pipe
will be used 15 times to send the update.

Same for queries, there's an internal software load balancer that sends
queries to one node in each shard with no control over what zone it's
in.

The same argument applies to separate physical data centers FWIW.

We're largely speculating that this may lead to bottlenecks, but it's
something to keep in mind. There are thoughts about making SolrCloud
rack aware in a way that will ameliorate this, but nobody has had
time to work on this yet.

We'd _love_ to hear about any real-life experience in this area!

Best
Erick

On Tue, Jun 18, 2013 at 4:37 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Erick,

 We at AOL mail have been using SOLR for quiet a while and our system is 
 pretty write heavy and disk I/O is one of our bottlenecks. At present we use 
 regular SOLR in the lotsOfCore configuration and I am in  the process of 
 benchmarking SOLR cloud for our use case. I don't have concrete data that 
 tLogs are placing lot of load on the system, but for a large scale system 
 like ours even minimal load gets magnified.


 From the Cloud design, for a properly set up cluster, usually you have 
 replicas at different availability zones . Probablity of losing more than 1 
 availability zone at any given time should be pretty low. Why have tLogs if 
 all replicas on an update get the request anyway, In theory 1 replica must be 
 able to commit eventually.

 NRT is an optional feature and probably not tied to Cloud, correct?


 Thanks,

 Rishi.







 -Original Message-
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Jun 18, 2013 4:07 pm
 Subject: Re: SOLR Cloud - Disable Transaction Logs


 bq: the replica can take over and maintain a durable
 state of my index

 This is not true. On an update, all the nodes in a slice
 have already written the data to the tlog, not just the
 leader. So if a leader goes down, the replicas have
 enough local info to insure that data is not lost. Without
 tlogs this would not be true since documents are not
 durably saved until a hard commit.

 tlogs save data between hard commits. As Yonik
 explained to me once, soft commits are about
 visibility, hard commits are about durability and
 tlogs fill up the gap between hard commits.

 So to reinforce Shalin's comment yes, you can disable tlogs
 if
 1 you don't want any of SolrCloud's HA/DR capabilities
 2 NRT is unimportant

 IOW if you're using 4.x just like you would 3.x in terms
 of replication, HA/DR, etc. This is perfectly reasonable,
 but don't get hung up on disabling tlogs.

 And you haven't told us _why_ you want to do this. They
 don't consume much memory or disk space unless you
 have configured your hard commits (with openSearcher
 true or false) to be quite long. Do you have any proof at
 all that the tlogs are placing enough load on the system
 to go down this road?

 Best
 Erick

 On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran rishi.easwa...@aol.com 
 wrote:
 SolrJ already has access to zookeeper cluster state. Network I/O bottleneck
 can be avoided by parallel requests.
 You are only as slow as your slowest responding server, which could be your
 single leader with the current set up.

 Wouldn't this lessen the burden of the leader, as he does not have to 
 maintain
 transaction logs or distribute to replicas?







 -Original Message-
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Jun 18, 2013 2:05 am
 Subject: Re: SOLR Cloud - Disable Transaction Logs


 Yes, but at what cost? You are thinking of replacing disk IO with even more
 slower network IO. The transaction log is a append-only log -- it is not
 pretty cheap especially so if you compare it with the indexing process.
 Plus your write request/sec will drop a lot once you start doing
 synchronous replication.


 On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran 
 rishi.easwa...@aol.comwrote:

 Shalin,

 Just some thoughts.

 Near Real time replication- don't we use solrCmdDistributor, which send
 requests immediately to replicas with a clonedRequest, as an option can't
 we achieve something similar form CloudSolrserver in Solrj instead of
 leader doing it. As long as 2 nodes receive writes and acknowledge.
 durability should be high.
 Peer-Sync and Recovery - Can we achieve that merging indexes from leader
 as needed, 

Re: Solr cloud: zkHost in solr.xml gets wiped out

2013-06-19 Thread Erick Erickson
Thanks for the confirmation! I was wondering where these bits came from
 wt=javabin version=2
since I wasn't seeing them, but you mentioned SolrCloud, so that
explains things.

It'll be tonight before I commit the fix I'm afraid, I'm traveling and
need to put one more test in.

Best
Erick

On Tue, Jun 18, 2013 at 5:47 PM, Al Wold alw...@alwold.com wrote:
 I just finished a test with the patch, and it looks like all is working well.

 On Jun 18, 2013, at 12:19 PM, Al Wold wrote:

 For the CREATE call, I'm doing it manually per the instructions here:

 http://wiki.apache.org/solr/SolrCloud

 Here's the exact URL I'm using:

 http://asu-solr-cloud.elasticbeanstalk.com/admin/collections?action=CREATEname=directorynumShards=2replicationFactor=2maxShardsPerNode=2

 I'm testing out your patch now, and I'll let you know how it goes.

 Thanks for all the help!

 -Al

 On Jun 18, 2013, at 6:47 AM, Erick Erickson wrote:

 OK, I think I see what's happening. If you do
 NOT specify an instanceDir on the create
 (and I'm doing this via the core admin
 interface, not SolrJ) then the default is
 used, but not persisted. If you _do_
 specify the instance dir, it will be persisted.

 I've put up another quick patch (tested
 only in my test case, running full suite
 now). Can you give it a whirl? You'll have
 to apply the patch over top of the current
 4x, een though the patch is for trunk it
 applied to 4x cleanly for me and the tests ran.

 Thanks,
 Erick

 On Tue, Jun 18, 2013 at 9:02 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 OK, I put up a very preliminary patch attached to the bug
 if you want to try it out that addresses the extra junk being
 put in the core tag. Doesn't address the instanceDir issue
 since I haven't reproduced it yet.

 Erick

 On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Whoa! What's this junk?
 qt=/admin/cores wt=javabin version=2

 That shouldn't be being preserved, and the instancedir should be!

 So I'm guessing you're using SolrJ to create the core, but I just
 reproduced the problem (at least the 'wt=json ') bit from the
 browser and even from one of my internal tests when I added
 extra parameters.

 That said, instanceDir is being preserved in my test, so I'm not
 seeing everything you're seeing, could you cut/paste your
 create code? I'll see if I can set up a test case for SolrJ to catch
 this too.

 See SOLR-4935

 Thanks for reporting!

 On Mon, Jun 17, 2013 at 5:39 PM, Al Wold alw...@alwold.com wrote:
 Hi Erick,
 I tried out your changes from the branch_4x branch. It looks good in 
 terms of preserving the zkHost, but I'm running into an exception 
 because it isn't persisting the instanceDir attribute on the core 
 element.

 I've got a few other things I need to take care of, but as soon as I 
 have time I'll dig in and see if I can figure out what's going on, and 
 see what changed to make this not work.

 Here are details on what the files looked like before/after CREATE call:

 original solr.xml:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true sharedLib=lib zkHost=10.116.249.136:2181
 !-- this 8080 might need to change in production --
 cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8080 
 hostContext=//
 /solr

 here's what was produced with 4.3 branch + a quick mod to preserve 
 zkHost:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true zkHost=10.116.249.136:2181 sharedLib=lib
 cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8080 
 hostContext=/
   core loadOnStartup=true shard=shard1 
 instanceDir=directory_shard1_replica1/ transient=false 
 name=directory_shard1_replica1 collection=directory/
   core loadOnStartup=true shard=shard2 
 instanceDir=directory_shard2_replica1/ transient=false 
 name=directory_shard2_replica1 collection=directory/
 /cores
 /solr

 here's what was produced with branch_4x 4.4-SNAPSHOT:

 ?xml version=1.0 encoding=UTF-8 ?
 solr persistent=true zkHost=10.116.249.136:2181 sharedLib=lib
 cores adminPath=/admin/cores zkClientTimeout=2 
 distribUpdateSoTimeout=0 distribUpdateConnTimeout=0 hostPort=8080 
 hostContext=/
   core shard=shard1 numShards=2 name=directory_shard1_replica2 
 collection=directory qt=/admin/cores wt=javabin version=2/
   core shard=shard2 numShards=2 name=directory_shard2_replica2 
 collection=directory qt=/admin/cores wt=javabin version=2/
 /cores
 /solr

 and here's the error from solr.log after restarting after the CREATE:

 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR 
 org.apache.solr.core.CoreContainer  - 
 null:java.lang.NullPointerException: Missing required 'instanceDir'
   at 
 org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133)
   at 
 org.apache.solr.core.CoreDescriptor.init(CoreDescriptor.java:87)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221)
   at 
 

Re: How to define my data in schema.xml

2013-06-19 Thread Mysurf Mail
Well,
Avoiding flattening the db to a flat table sounds like a great plan.
I found this solution
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example

import.a join. not handling a flat table.



On Tue, Jun 18, 2013 at 5:53 PM, Jack Krupansky j...@basetechnology.comwrote:

 You can in fact have multiple collections in Solr and do a limited amount
 of joining, and Solr has multivalued fields as well, but none of those
 techniques should be used to avoid the process of flattening and
 denormalizing a relational data model. It is hard work, but yes, it is
 required to use Solr effectively.

 Again, start with the queries - what problem are you trying to solve.
 Nobody stores data just for the sake of storing it - how will the data be
 used?


 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, June 18, 2013 9:58 AM

 To: solr-user@lucene.apache.org
 Subject: Re: How to define my data in schema.xml

 Hi Jack,
 Thanks, for you kind comment.

 I am truly in the beginning of data modeling my schema over an existing
 working DB.
 I have used the school-teachers-student db as an example scenario.
 (a, I have written it as a disclaimer in my first post. b. I really do not
 know anyone that has 300 hobbies too.)

 In real life my db is obviously much different,
 I just used this as an example of potential pitfalls that will occur if I
 use my old db data modeling notions.
 obviously, the old relational modeling idioms do not apply here.

 Now, my question was referring to the fact that I would really like to
 avoid a flat table/join/view because of the reason listed above.
 So, my scenario is answering a plain user generated text search over a
 MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

 So, I come here for tips. Should I use one combined index (treat it as a
 nosql source) or separate indices or another. any other ways to define
 relation data ?
 Thanks.



 On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  It sounds like you still have a lot of work to do on your data model. No
 matter how you slice it, 8 billion rows/fields/whatever is still way too
 much for any engine to search on a single server. If you have 8 billion of
 anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
 plan ahead to put more than 100 million rows on a single node; plan on a
 proof of concept implementation to determine that number.

 When we in Solr land say flattened or denormalized, we mean in an
 intelligent, smart, thoughtful sense, not a mindless, mechanical
 flattening. It is an opportunity for you to reconsider your data models,
 both old and new.

 Maybe data modeling is beyond your skill set. If so, have a chat with your
 boss and ask for some assistance, training, whatever.

 Actually, I am suspicious of your 8 billion number - change each of those
 300's to realistic, average numbers. Each teacher teaches 300 courses?
 Right. Each Student has 300 hobbies? If you say so, but...

 Don't worry about schema.xml until you get your data model under control.

 For an initial focus, try envisioning the use cases for user queries. That
 will guide you in thinking about how the data would need to be organized
 to
 satisfy those user queries.

 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, June 18, 2013 2:20 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to define my data in schema.xml


 Thanks for your reply.
 I have tried the simplest approach and it works absolutely fantastic.
 Huge table - 0s to result.

 two problems as I described earlier, and that is what I try to solve:
 1. I create a flat table just for solar. This requires maintenance and
 develop. Can I run solr over my regular tables?
This is my simplest approach. Working over my relational tables,
 2. When you query a flat table by school name, as I described, if the
 school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
 studentHobbies,
you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
 great on solar - searching for the school name will retrieve 8.1 B rows.
 3. Lets say all my searches are user generated free text search that is
 searching name and comments columns.
 Thanks.


 On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:

  On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:

  Thanks for your quick reply. Here are some notes:
 
  1. Consider that all tables in my example have two columns: Name 
  Description which I would like to index and search.
  2. I have no other reason to create flat table other than for solar. So
  I
  would like to see if I can avoid it.
  3. If in my example I will have a flat table then obviously it will 
 hold
 a
  lot of rows for a single school.
  By searching the exact school name I will likely receive a lot of
 rows.
  (my flat table has its own pk)

 Yes, all of this is 

Re: PostingsSolrHighlighter not working on Multivalue field

2013-06-19 Thread Erick Erickson
Well, _how_ does it fail? unless it's a type it should be
multiValued (not capital 'V'). This probably isn't the
problem, but just in case.

Anything in the logs? What is the field definition?
Did you re-index after changing to multiValued?

Best
Erick

On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu floyd...@gmail.com wrote:
 In my test case, it seems this new highlighter not working.

 When field set multivalue=true, the stored text in this field can not be
 highlighted.

 Am I miss something? Or this is current limitation? I have no luck to find
 any documentations mentioned this.

 Floyd


Re: Solr string field stripping new lines line breaks

2013-06-19 Thread Erick Erickson
First, please start a new thread when you change the topic,
doing so makes the threads easier to track.

But what is your evidence that line breaks are stripped? The
stored data is a verbatim copy of the data that went in to the
field, nothing at all is changed. So one of several things is
happening
1 they may be being stripped by whatever turns the PDF into
a Solr document, SOLARIUM?
2 if you're displaying them in a browser, the line breaks may be
there but just being ignored by the browser.

You could write a very brief SolrJ program or similar and see the
raw output by getting the data directly from your index...

Best
Erick

On Wed, Jun 19, 2013 at 5:50 AM, sodoo first...@yahoo.com wrote:
 Dears,

 My english is bad. But I will try to explain.

 I have indexed databases and files. The files included : docx, pdf, txt.
 Then I have indexed all of data.
 But my indexed document  pdf files text all of through continued.

 I try to appear line break text.
 Document files text line breaks to indexed document also line breaks.

 My frontend app is SOLARIUM.

 How can I appear line break the indexed data?
 Please assist me on this.

 Thank you



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-string-field-stripping-new-lines-line-breaks-tp3984384p4071595.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Sharding and Replication

2013-06-19 Thread Asif
Hi,

I had questions on implementation of Sharding and Replication features of
Solr/Cloud.

1. I noticed that when sharding is enabled for a collection - individual
requests are sent to each node serving as a shard.

2. Replication too follows above strategy of sending individual documents
to the nodes serving as a replica.

I am working with a system that requires massive number of writes - I have
noticed that due to above reason - the cloud eventually starts to fail
(Even though I am using a ensemble).

I do understand the reason behind individual updates - but why not batch
them up or give a option to batch N updates in either of the above case - I
did come across a presentation that talked about batching 10 updates for
replication at least, but I do not think this is the case.
- Asif


Re: Solr Suggest does not work in solrcloud environment

2013-06-19 Thread Aloke Ghoshal
Hi,

 Check the obvious first, that you have rebuilt  reloaded the suggest
dictionary individually on all nodes. Also the other checks here:
http://stackoverflow.com/questions/6653186/solr-suggester-not-returning-any-results

Then, try with one of query component OR distrib=false setting:
http://lucene.472066.n3.nabble.com/SolrCloud-vs-distributed-suggester-td4041859.html

Your suggester entry seems a little bloated. Check if the commented
portions are needed:

searchComponent class=solr.SpellCheckComponent name=suggest
str name=queryAnalyzerFieldTypesuggest_text/str
lst name=spellchecker
str name=namesuggest/str
str
name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
str name=fieldsuggest/str
float name=threshold0/float
str name=buildOnCommittrue/str
/lst
!-- !!! Do you need these? !!!
lst name=spellchecker
str name=namedefault/str
str name=fieldsuggest/str
str name=classnamesolr.DirectSolrSpellChecker/str
str name=distanceMeasureinternal/str
float name=accuracy0.2/float
int name=maxEdits2/int
int name=minPrefix1/int
int name=maxInspections50/int
int name=minQueryLength2/int
float name=maxQueryFrequency0.01/float
/lst
lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldsuggest/str
str name=combineWordstrue/str
str name=breakWordstrue/str
int name=maxChanges10/int
/lst
--
/searchComponent


requestHandler class=org.apache.solr.handler.component.SearchHandler
name=/suggest
lst name=defaults
str name=spellchecktrue/str
!--  !!! Do you need these? !!!
str name=spellcheck.dictionarydefault/str
str name=spellcheck.dictionarywordbreak/str
--
str name=spellcheck.dictionarysuggest/str
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.count10/str
str name=spellcheck.collatetrue/str
/lst
arr name=components
strsuggest/str
!-- !!! Add query component here !!!--
strquery/str
/arr
/requestHandler


Regards,
Aloke



On Wed, Jun 19, 2013 at 2:33 PM, Sharp s.sh...@infovations.ch wrote:

 Hi Guys

 I am having difficulties running a suggest Search Handler in a solrcloud
 environment. The configuration was tested on a standalone machine and works
 fine there.

 Here is my configuration:

 *Schema.xml*

 field name=suggest type=suggest_text indexed=true stored=false
 multiValued=true /

 copyField source=field1 dest=suggest /
 copyField source=field2 dest=suggest /
 copyField source=field3 dest=suggest /
 ...

 fieldType name=suggest_text class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.SynonymFilterFactory
 synonyms=synonym.txt
 ignoreCase=true
 expand=true /
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopword.txt
 enablePositionIncrements=true /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.KeywordMarkerFilterFactory
 protected=protword.txt
 /
 /analyzer
 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopword.txt
 enablePositionIncrements=true /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.KeywordMarkerFilterFactory
 protected=protword.txt
 /
 /analyzer
 /fieldType


 *Solrconfig.xml*

 searchComponent class=solr.SpellCheckComponent name=suggest
 str name=queryAnalyzerFieldTypesuggest_text/str
 lst name=spellchecker
 str name=namesuggest/str
 str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
 str
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
 str name=fieldsuggest/str
 float name=threshold0/float
 str name=buildOnCommittrue/str
 /lst
 lst name=spellchecker
 str name=namedefault/str
 str name=fieldsuggest/str
 str 

Re: UnInverted multi-valued field

2013-06-19 Thread Jack Krupansky

Take a look at using DocValues for faceted fields.

-- Jack Krupansky

-Original Message- 
From: Jochen Lienhard

Sent: Wednesday, June 19, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: UnInverted multi-valued field

Hi @all.

We have the problem that after an update the index takes to much time
for 'warm up'.

We have some multivalued facet-fields and during the startup solr
creates the messages:

INFO: UnInverted multi-valued field
{field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0}


In the solconfig we use the facet.method 'fc'.
We know, that the start-up with the method 'enum' is faster, but then
the searches are very slow.

How do you handle this problem?
Or have you any idea for optimizing the warm up?
Or what do you do after an update?

Greetings

Jochen

--
Dr. rer. nat. Jochen Lienhard
Dezernat EDV

Albert-Ludwigs-Universität Freiburg
Universitätsbibliothek
Rempartstr. 10-16  | Postfach 1629
79098 Freiburg | 79016 Freiburg

Telefon: +49 761 203-3908
E-Mail: lienh...@ub.uni-freiburg.de
Internet: www.ub.uni-freiburg.de 



Re: Disable Replication for all Cores in a single Command

2013-06-19 Thread Shawn Heisey
On 6/19/2013 2:18 AM, Ralf Heyde wrote:
 Hello Folks,
 
 is it possible to disable the replication for ALL cores using one
 command? We currently use Solr 3.6.
 
 Currently we have a curl operation, which fires:
 http://slave_host:port/solr/core/admin/replication/index.jsp?poll=disable
 
 
 In the documentation there is a URL-Command which seems to be correct,
 but it says 404.
 http://slave_host:port/solr/replication?command=disablepoll

I don't think there is a way to do this, because each Solr core is
self-contained and its configuration is independent of the others.

The URL that you have shown that doesn't include the core name will only
work in a multicore environment if the defaultCoreName attribute is
found in solr.xml, and will only access the specific core that is named
there.  I know this attribute works in Solr 4.x, but I don't know if it
worked in 3.x.  I have never used it.

It might actually make sense to add one or more actions to the CoreAdmin
for this, but I'm fairly sure that the feature doesn't currently exist.

Thanks,
Shawn



Re: Solr Cloud Hangs consistently .

2013-06-19 Thread Rishi Easwaran
Update!!

Got SOLR cloud working, was able to do 90k document inserts with 
replicationFactor=2, with my jmeter script, previously was getting stuck with 
3k inserts or less.
After some investigation, figured out that ulimits for my process were not 
being set properly, OS defaults were kicking in, which is very small for a 
server app.
One of our install script had changed.
I had to up the ulimits - -n,-u,-v and for now no other issues seen.


 

 

-Original Message-
From: Rishi Easwaran rishi.easwa...@aol.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Jun 18, 2013 10:40 am
Subject: Re: Solr Cloud Hangs consistently .


Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro yago.rive...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 

is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

 If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
 
 In any case, there is something special about it - I do and have seen a lot 
 of 

heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 

load is being done or what features/methods are being used that likely causes 
it 

or makes it easier to cause.
 
 But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
 
 - Mark
 
 On Jun 17, 2013, at 5:52 PM, Rishi Easwaran rishi.easwa...@aol.com 
(mailto:rishi.easwa...@aol.com) wrote:
 
  Update!!
  
  This happens with replicationFactor=1
  Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
  Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
  Only indication seems to be netstat showing incoming request not being read 
in.
  
  Yago,
  
  I saw your previous post 
  (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
  Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
  Looks like this is a dominant and easily reproducible issue on SOLR cloud.
  
  
  Thanks,
  
  Rishi. 
  
  
  
  
  
  
  
  
  
  
  
  -Original Message-
  From: Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com)
  To: solr-user solr-user@lucene.apache.org 
  (mailto:solr-user@lucene.apache.org)
  Sent: Mon, Jun 17, 2013 5:15 pm
  Subject: Re: Solr Cloud Hangs consistently .
  
  
  I can confirm that the deadlock happen with only 2 replicas by shard. I 
  need 


  shutdown one node that host a replica of the shard to recover the 
  indexation 


  capability.
  
  -- 
  Yago Riveiro
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
  On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
  
   
   
   Hi All,
   
   I am trying to benchmark SOLR Cloud and it consistently hangs. 
   Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
   
   A little bit about my set up. 
   I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
   
  
  is configured to have 8 SOLR cloud nodes running at 4GB each.
   JVM configs: http://apaste.info/57Ai
   
   My cluster has 12 shards with replication factor 2- 
   http://apaste.info/09sA
   
   I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
  running this configuration in production in Non-Cloud form. 
   It got stuck repeatedly.
   
   I decided to upgrade to the latest and greatest of everything, SOLR 4.3, 
JDK7 
  and tomcat7. 
   It still shows same behaviour and hangs through the test.
   
   My test schema and config.
   Schema.xml - http://apaste.info/imah
   SolrConfig.xml - http://apaste.info/ku4F
   
   The test is pretty simple. its a jmeter test with update command via SOAP 
rpc 
  (round robin 

Highlighting using hl.q without a df field

2013-06-19 Thread AdamP
Is it possible to use the hl.q field if you’re using the extended dismax
query parser and have defined the “qf” field, but not a “df” field?  

Here’s a sample query:  

q=drivefq=cat:electronicshl=truehl.fl=cat,namehl.q=drive
cat:electronics.  

In this case I want to highlight the facet “electronics” and the word
“drive” within the cat and name fields.  Assuming I’m understanding the wiki
correctly, snippets should be generated for the hl.fl fields.  What I’m
getting is an error message saying “no field name specified in query and no
default specified via 'df' param.  If I remove the word “drive” from the
hl.q field, it works correctly, which makes sense since given the error.  I
just don’t understand why it’s not using the “qf” or “fl.hl” fields to query
against.

Thanks 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-using-hl-q-without-a-df-field-tp4071648.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: UnInverted multi-valued field

2013-06-19 Thread Toke Eskildsen
On Wed, 2013-06-19 at 11:30 +0200, Jochen Lienhard wrote:
 INFO: UnInverted multi-valued field 
 {field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0}

170ms does not sound like much to me. What are you hoping for?

 We know, that the start-up with the method 'enum' is faster, but then
 the searches are very slow.

That is a bit strange. With only 17 terms, enum should be quite fast.
How much do the two methods differ in speed?

- Toke Eskildsen




Re: yet another optimize question

2013-06-19 Thread Andre Bois-Crettez

indeed the actual syntax for per field facet is :

f.mysparefieldname.facet.method=enum

André

On 06/18/2013 09:00 PM, Petersen, Robert wrote:

Hi Andre,

Wow that is astonishing!  I will definitely also try that out!  Just set the 
facet method on a per field basis for the less used sparse facet fields eh?  
Thanks for the tip.

Thanks
Robi

-Original Message-
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com]
Sent: Tuesday, June 18, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Recently we had steadily increasing memory usage and OOM due to facets on 
dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints for 
each field (a fieldCache or fieldValueCahe entry), whether it is sparsely 
populated or not.

Once you have reduced your number of maxDocs with the merge policy, it can be 
interesting to try facet.method=enum for all the sparsely populated dynamic 
fields.
Despite what is said in the wiki, in our case the performance was similar to 
facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:

Also some time ago I made all our caches small enough to keep us from getting 
OOMs while still having a good hit rate.Our index has about 50 fields which 
are mostly int IDs and there are some dynamic fields also.  These dynamic 
fields can be used for custom faceting.  We have some standard facets we always 
facet on and other dynamic facets which are only used if the query is filtering 
on a particular category.  There are hundreds of these fields but since they 
are only for a small subset of the overall index they are very sparsely 
populated with regard to the overall index.

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Solr Suggest does not work in solrcloud environment

2013-06-19 Thread Sharp
Hi Aloke

Thanks for your reply. It works with the 

http://url.com:8983/solr/mycore/suggest?q=barwt=jsondistrib=true

parameter or when inserted into the defaults

requestHandler class=org.apache.solr.handler.component.SearchHandler
name=/suggest
lst name=defaults
str name=spellchecktrue/str
str name=spellcheck.dictionarydefault/str
str name=spellcheck.dictionarysuggest/str
str name=spellcheck.onlyMorePopularfalse/str
str name=spellcheck.count10/str
str name=spellcheck.collatetrue/str
bool name=distribfalse/bool
/lst
arr name=components
strsuggest/str
/arr
/requestHandler

I use the bootstrap parameter at startup. So configuration is deployed to
all other servers. The query component just creates additional output but
nothing usefull.

arr name=components
strsuggest/str

strquery/str
/arr


So why is the additional parameter necessary? I would assume that solr takes
care of it internaly. I have only conifugred one shard. 

But thanks anyway. It works as a workaround so far.

Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Suggest-does-not-work-in-solrcloud-environment-tp4071587p4071660.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to dynamically add geo fields to a query using a request handler

2013-06-19 Thread ade-b
Hi

We have a request handler defined in solrconfig.xml that specifies a list of
fields to return for the request using the fl name.

E.g. str name=flcreatedDate/str

When constructing a query using solrj that uses this request handler, we
want to conditionally add the geo spatial fields that will tell us the
distance of a record in the solr index from a given location. Currently we
add this to the query by specifying 

solrQuery.set(fl, *,distance:geodist());

This has the effect of returning all fields for the record - not those
specified in the request handler. I'm assuming this is because of the * in
the solrQuery.set method is overriding those statically defined in the
request handler.

I have tried to add the geodist property via the solrQuery.addField()
method, but that complains saying it is not a valid field - maybe I used it
incorrectly?

Has anybody any ideas how to achieve this?

Thanks
Ade






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-dynamically-add-geo-fields-to-a-query-using-a-request-handler-tp4071655.html
Sent from the Solr - User mailing list archive at Nabble.com.


another transaction log + commit question

2013-06-19 Thread Joshi, Shital
Hi,

We hard  committed (/update/csv?commit=true) about 20,000 documents to 
SolrCloud (5 shards, 1 replicas = 10 jvm instances). We have commented out both 
autoCommit and autoSoftCommit settings from solrconfig.xml. What we noticed 
that the transaction log size never goes down to 0. We thought once fsync to 
all replicas etc. finishes the trans log should get deleted since everything is 
persisted. We restarted cloud couple times but trans log was always bigger than 
size of the index for that shard. Why is that?

1.9M$HOME/solr_data/solr1
3.0M$HOME/solr_data/solr1_tranlog
2.2M$HOME/solr_data/solr2
3.0M$HOME/solr_data/solr2_tranlog


If we have commented out autoCommit setting from solrconfig.xml and we hard 
commit say 20K  documents every 10 minutes, when will a new searcher get 
created? Without autoCommit setting, what is the default behavior of new 
searcher?

One last question, does a new searcher get created and all caches gets 
refreshed for every soft commit? OR Solr updates the existing searcher with 
what got changed during last soft commit?

Many Thanks!



Re: UnInverted multi-valued field

2013-06-19 Thread Roman Chyla
On Wed, Jun 19, 2013 at 5:30 AM, Jochen Lienhard 
lienh...@ub.uni-freiburg.de wrote:

 Hi @all.

 We have the problem that after an update the index takes to much time for
 'warm up'.

 We have some multivalued facet-fields and during the startup solr creates
 the messages:

 INFO: UnInverted multi-valued field {field=mt_facet,memSize=**
 18753256,tindexSize=54,time=**170,phase1=156,nTerms=17,**
 bigTerms=3,termInstances=**903276,uses=0}


 In the solconfig we use the facet.method 'fc'.
 We know, that the start-up with the method 'enum' is faster, but then the
 searches are very slow.

 How do you handle this problem?
 Or have you any idea for optimizing the warm up?
 Or what do you do after an update?


You probably know, but just in case... you may use autowarming; the
searcher will populate the cache and only after the warmup queries
finished, will it be exposed to the world. The old searcher continues to
handle requests in the meantime.

roman



 Greetings

 Jochen

 --
 Dr. rer. nat. Jochen Lienhard
 Dezernat EDV

 Albert-Ludwigs-Universität Freiburg
 Universitätsbibliothek
 Rempartstr. 10-16  | Postfach 1629
 79098 Freiburg | 79016 Freiburg

 Telefon: +49 761 203-3908
 E-Mail: lienh...@ub.uni-freiburg.de
 Internet: www.ub.uni-freiburg.de




Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless
The default is 2.0, and higher values will more strongly favor merging
segments with deletes.

I think 20.0 is likely way too high ... maybe try 3-5?


Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to do 
 optimizes on our continuously updated index in solr3.6.1 and I came across 
 the mention of the reclaimDeletesWeight setting in this blog: 
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department



Re: Question about SOLR search relevance score

2013-06-19 Thread Gora Mohanty
On 19 June 2013 21:15, sérgio Alves sd_t_al...@hotmail.com wrote:
[...]
 Right now we're having problems with some common search terms. They
 return varied results on the search results, and the products which
 should appear first in the results, are scored lower than other,
 seemingly unrelated, products.
[...]
 I wanted to know if there is a parameter or any possible way for me to
 know the way that solr calculates the scores it returns. For example, if
  we had a search relevancy formula like
 QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can
  I know that brand scored 'x', for
  name 'y' and category 'z'. Is that possible? How can I do that?
[...]

To get an explanation of the scoring, add debugQuery=on
as a parameter to your Solr search URL. Please see
http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
There are also various 'explain' parameters that might be
useful.

I take it that you have already seen
http://wiki.apache.org/solr/SolrRelevancyFAQ

Regards,
Gora


Question about SOLR search relevance score

2013-06-19 Thread sérgio Alves
Hi.





My name is Sérgio Alves and I'm a developer in a project that uses solr as its 
search engine.





Right now we're having problems with some common search terms. They 
return varied results on the search results, and the products which 
should appear first in the results, are scored lower than other, 
seemingly unrelated, products.





I wanted to know if there is a parameter or any possible way for me to 
know the way that solr calculates the scores it returns. For example, if
 we had a search relevancy formula like 
QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can
 I know that brand scored 'x', for
 name 'y' and category 'z'. Is that possible? How can I do that?





This is urgent, if someone could take the time and answer this 
topic to me in a quick manner, I would really appreciate it.





Thank you very much for the attention, best regards,


Sérgio Alves  

RE: Question about SOLR search relevance score

2013-06-19 Thread Swati Swoboda
Hi Sergio,

Append 'debugQuery=on' to your queries to learn more about how your queries 
are being evaluated/ranked.

i.e. 
qf=attributes_name^15+attributes_brand^10+attributes_category^8debugQuery=on

You'll get an XML section that is dedicated to debug information.

I've found http://explain.solr.pl/ useful in understanding and visualizing the 
debug output.

Swati

-Original Message-
From: sérgio Alves [mailto:sd_t_al...@hotmail.com] 
Sent: Wednesday, June 19, 2013 11:45 AM
To: solr-user@lucene.apache.org
Subject: Question about SOLR search relevance score

Hi.





My name is Sérgio Alves and I'm a developer in a project that uses solr as its 
search engine.





Right now we're having problems with some common search terms. They 
return varied results on the search results, and the products which 
should appear first in the results, are scored lower than other, 
seemingly unrelated, products.





I wanted to know if there is a parameter or any possible way for me to 
know the way that solr calculates the scores it returns. For example, if
 we had a search relevancy formula like 
QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can
 I know that brand scored 'x', for
 name 'y' and category 'z'. Is that possible? How can I do that?





This is urgent, if someone could take the time and answer this 
topic to me in a quick manner, I would really appreciate it.





Thank you very much for the attention, best regards,


Sérgio Alves  


Apparent odd interaction between autoCommit values and indexing ram buffer

2013-06-19 Thread Shawn Heisey

I've run into something a little odd that's been happening for a while.

The apparent symptoms: Two index segments are created every time an 
autoCommit (hard, not soft) happens during a DIH full-import.


Here's the directory listing from the first few minutes of importing, 
and a related INFOSTREAM:


http://apaste.info/22ue
https://dl.dropboxusercontent.com/u/97770508/INFOSTREAM-s1build.txt

The INFOSTREAM file has cruft from before, so if you search for 3g8 in 
the file, you'll be at the beginning of the relevant section.


I brought this up without resolution on the dev list last December. 
After some discussion in #solr-dev yesterday and some poking around with 
branch_4x, I think I might have figured out (at a high level) what's 
going on.


My 'ramBufferSizeMB' value is 48, and my autoCommit maxDocs is 25000. 
My documents probably tend to be 1-2kb, with some increasing a little 
beyond that.


Looking at the numDocs for each segment, here's what I think is happening:

The autoCommit kicks in after the first 25000 docs (25002 to be 
precise), but the ram buffer isn't emptied. The next 3339 documents get 
indexed, at which point the ram buffer fills up, so it flushes another 
segment.  Then it does another 21674 docs to approximately reach 25000 
for autoCommit, which forces another segment flush, but without emptying 
the buffer.  lather, rinse, repeat.


Each pair of numDocs values after the initial 25002 does add up to 
approximately 25000.


If I'm right about what's happening here, then here's the big question: 
Should the ram buffer be emptied when autoCommit triggers?  I think that 
it should, but can it be done without drastically affecting performance? 
 I haven't looked at the code ... I expect that it'll take me forever 
to understand it well enough to figure out if I'm right or wrong.


Update by query?

2013-06-19 Thread Timothy Potter
Quick check to see if Solr supports an update-by-query feature or if
anyone has thought about something like this ... similar to
delete-by-query

My specific use case is a metadata field needs to be updated for N
docs where N  1 and the set can easily be identified by a query.
Currently, I have to pull them all back and update, which works but is
concerning when N is very large.

I checked JIRA and didn't see mention of this but might have missed it.

Cheers,
Tim


SOLR : ArrayIndexOutOfBoundsException from SolrDispatchFilter

2013-06-19 Thread Rohit Kumar
Need help to figure out the error below.


*Code Snippet*:

public class ConnectionComponent extends SearchComponent {

@Override

public void process(ResponseBuilder rb) throws IOException {

NamedList nList = new SimpleOrderedMap();



NamedList nl= new SimpleOrderedMap();


ListDocument ld = new ArrayListDocument();
Document mydoc = new Document();
mydoc.add(f); //IndexableField f not null
ld.add(mydoc);

nl.add(someKey, ld);

nList.add(otherKey, nl);



// rb instance of ResponseBuilder

rb.rsp.add(returnKey, nList);

  }

}


RROR org.apache.solr.servlet.SolrDispatchFilter  ?
null:java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at java.util.Collections$UnmodifiableList.get(Collections.java:1152)
at 
org.apache.solr.response.transform.ValueSourceAugmenter.transform(ValueSourceAugmenter.java:92)
at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:165)
at org.apache.solr.response.JSONWriter.writeArray(JSONResponseWriter.java:526)
at 
org.apache.solr.response.TextResponseWriter.writeArray(TextResponseWriter.java:289)
at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:192)
at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)
at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)
at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)


Re: Apparent odd interaction between autoCommit values and indexing ram buffer

2013-06-19 Thread Shawn Heisey

On 6/19/2013 10:38 AM, Shawn Heisey wrote:

Looking at the numDocs for each segment, here's what I think is happening:

The autoCommit kicks in after the first 25000 docs (25002 to be
precise), but the ram buffer isn't emptied. The next 3339 documents get
indexed, at which point the ram buffer fills up, so it flushes another
segment.  Then it does another 21674 docs to approximately reach 25000
for autoCommit, which forces another segment flush, but without emptying
the buffer.  lather, rinse, repeat.


I seem to be wrong about it being strictly related to ramBufferSizeMB. 
Today I bumped the buffer up to 256MB, restarted Solr, and started 
another full-import.


If I were completely right about the buffer interaction, this should 
have resulted in a few somewhat equal sized segments being created 
before creating a small one.  It didn't change anything - it's still two 
segments per autocommit, one of which is around 3000 docs and the other 
adds to that to make about 25000.


There's still something weird going on, but now I know that I don't 
completely understand it.  I hope someone can shed some light.


Thanks,
Shawn



Re: Update by query?

2013-06-19 Thread Jack Krupansky
It has come up before as a nice feature to have, but isn't in Solr right 
now.


I'd say go ahead and file a Jira for a new feature.

-- Jack Krupansky

-Original Message- 
From: Timothy Potter

Sent: Wednesday, June 19, 2013 12:57 PM
To: solr-user@lucene.apache.org
Subject: Update by query?

Quick check to see if Solr supports an update-by-query feature or if
anyone has thought about something like this ... similar to
delete-by-query

My specific use case is a metadata field needs to be updated for N
docs where N  1 and the set can easily be identified by a query.
Currently, I have to pull them all back and update, which works but is
concerning when N is very large.

I checked JIRA and didn't see mention of this but might have missed it.

Cheers,
Tim 



Wildcards and Phrase queries

2013-06-19 Thread Isaac Hebsh
Hi,

I'm trying to understand what is the status of enabling wildcards on phrase
queries?

Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604

It looks like these issues are not going to be solved in the close future
:( Will they? Did they came into a (partially) dead-end, in the current
approach. Can I contribute anything to make them fixed into an official
version?

Does the lastest patches which attached to rthe JIRAs are production ready?

[Should this message be sent to java-user list?]


RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
Hi Walter,

I used to have larger settings on our caches but it seemed like I had to make 
the caches that small to reduce memory usage to keep from getting the dreaded 
OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
slave farm has a load balancer in front of twelve slave servers and our index 
is being updated constantly, pretty much 24/7.  

So my question would be how do you run with such big caches without going into 
the OOM zone?  Was the Netflix index only updated based upon the release 
schedules of the studios, like once a week?  Our entertainment stores used to 
be like that before we turned into a marketplace based e-tailer, but now we get 
new listings from merchants all the time and so have a constant churn of 
additions and deletions in our index.

I feel like at 32GB our heap is really huge, but we seem to use almost all of 
it with these settings.   I am trying out the G1GC on one slave to see if that 
gets memory usage lower but while it has a different collection pattern in the 
various spaces it seems like the total memory usage peaks out at about the same 
level.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, June 18, 2013 6:57 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Your query cache is far too small. Most of the default caches are too small.

We run with 10K entries and get a hit rate around 0.30 across four servers. 
This rate goes up with more queries, down with less, but try a bigger cache, 
especially if you are updating the index infrequently, like once per day.

At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache 
in front of it. The HTTP cache had an 80% hit rate.

I'd increase your document cache, too. I usually see about 0.75 or better on 
that.

wunder

On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:

 Hi Otis, 
 
 Yes the query results cache is just about worthless.   I guess we have too 
 diverse of a set of user queries.  The business unit has decided to let bots 
 crawl our search pages too so that doesn't help either.  I turned it way down 
 but decided to keep it because my understanding was that it would still help 
 for users going from page 1 to page 2 in a search.  Is that true?
 
 Thanks
 Robi
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
 Sent: Monday, June 17, 2013 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Hi Robi,
 
 This goes against the original problem of getting OOMEs, but it looks like 
 each of your Solr caches could be a little bigger if you want to eliminate 
 evictions, with the query results one possibly not being worth keeping if you 
 can't get the hit % up enough.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 
 
 On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly 
 int IDs and there are some dynamic fields also.  These dynamic fields can be 
 used for custom faceting.  We have some standard facets we always facet on 
 and other dynamic facets which are only used if the query is filtering on a 
 particular category.  There are hundreds of these fields but since they are 
 only for a small subset of the overall index they are very sparsely 
 populated with regard to the overall index.  With CMS GC we get a sawtooth 
 on the old generation (I guess every replication and commit causes it's 
 usage to drop down to 10GB or so) and it seems to be the old generation 
 which is the main space consumer.  With the G1GC, the memory map looked 
 totally different!  I was a little lost looking at memory consumption with 
 that GC.  Maybe I'll try it again now that the index is a bit smaller than 
 it was last time I tried it.  After four days without running an optimize 
 now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
 reducing the segments might be ok...
 
 Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
 but unfortunately I guess I can't send the history graphics to the solr-user 
 list to show their changes over time:
NameUsedCommitted   Max   
   Initial Group
 Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
   108.13 MB   HEAP
 CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
 MBNON_HEAP
 Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
 NON_HEAP
 CMS Old Gen20.22 GB30.94 GB30.94 GB  
   30.94 GB  

RE: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Petersen, Robert
OK thanks, will do.  Just out of curiosity, what would having that set way too 
high do?  Would the index become fragmented or what?

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, June 19, 2013 9:33 AM
To: solr-user@lucene.apache.org
Subject: Re: TieredMergePolicy reclaimDeletesWeight

The default is 2.0, and higher values will more strongly favor merging segments 
with deletes.

I think 20.0 is likely way too high ... maybe try 3-5?


Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to 
 do optimizes on our continuously updated index in solr3.6.1 and I came 
 across the mention of the reclaimDeletesWeight setting in this blog: 
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department





Sharding and Replication clarification

2013-06-19 Thread Asif
Hi,

I had questions on implementation of Sharding and Replication features of
Solr/Cloud.

1. I noticed that when sharding is enabled for a collection - individual
requests are sent to each node serving as a shard.

2. Replication too follows above strategy of sending individual documents
to the nodes serving as a replica.

I am working with a system that requires massive number of writes - I have
noticed that due to above reason - the cloud eventually starts to fail
(Even though I am using a ensemble).

I do understand the reason behind individual updates - but why not batch
them up or give a option to batch N updates in either of the above case - I
did come across a presentation that talked about batching 10 updates for
replication at least, but I do not think this is the case.
- Asif


Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless
Way too high would cause it to pick highly lopsided merges just
because a few deletes were removed.

Highly lopsided merges (e.g. one big segment and N tiny segments) can
be horrible because it can lead to O(N^2) merge cost over time.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 OK thanks, will do.  Just out of curiosity, what would having that set way 
 too high do?  Would the index become fragmented or what?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Wednesday, June 19, 2013 9:33 AM
 To: solr-user@lucene.apache.org
 Subject: Re: TieredMergePolicy reclaimDeletesWeight

 The default is 2.0, and higher values will more strongly favor merging 
 segments with deletes.

 I think 20.0 is likely way too high ... maybe try 3-5?


 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to
 do optimizes on our continuously updated index in solr3.6.1 and I came
 across the mention of the reclaimDeletesWeight setting in this blog:
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department





RE: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Petersen, Robert
Oh!  Thanks for the info.  I'll change that right away.

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, June 19, 2013 10:42 AM
To: solr-user@lucene.apache.org
Subject: Re: TieredMergePolicy reclaimDeletesWeight

Way too high would cause it to pick highly lopsided merges just because a few 
deletes were removed.

Highly lopsided merges (e.g. one big segment and N tiny segments) can be 
horrible because it can lead to O(N^2) merge cost over time.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 OK thanks, will do.  Just out of curiosity, what would having that set way 
 too high do?  Would the index become fragmented or what?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Wednesday, June 19, 2013 9:33 AM
 To: solr-user@lucene.apache.org
 Subject: Re: TieredMergePolicy reclaimDeletesWeight

 The default is 2.0, and higher values will more strongly favor merging 
 segments with deletes.

 I think 20.0 is likely way too high ... maybe try 3-5?


 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to 
 do optimizes on our continuously updated index in solr3.6.1 and I 
 came across the mention of the reclaimDeletesWeight setting in this blog:
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-me
 r
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessful
 B
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=
 3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department







Re: yet another optimize question

2013-06-19 Thread Walter Underwood
I generally run with an 8GB heap for a system that does no faceting. 32GB does 
seem rather large, but you really should have room for bigger caches.

The Akamai cache will reduce your hit rate a lot. That is OK, because users are 
getting faster responses than they would from Solr. A 5% hit rate may be OK 
since you have that front end HTTP cache.

The Netflix index was updated daily. 

wunder

On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote:

 Hi Walter,
 
 I used to have larger settings on our caches but it seemed like I had to make 
 the caches that small to reduce memory usage to keep from getting the dreaded 
 OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
 slave farm has a load balancer in front of twelve slave servers and our index 
 is being updated constantly, pretty much 24/7.  
 
 So my question would be how do you run with such big caches without going 
 into the OOM zone?  Was the Netflix index only updated based upon the release 
 schedules of the studios, like once a week?  Our entertainment stores used to 
 be like that before we turned into a marketplace based e-tailer, but now we 
 get new listings from merchants all the time and so have a constant churn of 
 additions and deletions in our index.
 
 I feel like at 32GB our heap is really huge, but we seem to use almost all of 
 it with these settings.   I am trying out the G1GC on one slave to see if 
 that gets memory usage lower but while it has a different collection pattern 
 in the various spaces it seems like the total memory usage peaks out at about 
 the same level.
 
 Thanks
 Robi
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Tuesday, June 18, 2013 6:57 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Your query cache is far too small. Most of the default caches are too small.
 
 We run with 10K entries and get a hit rate around 0.30 across four servers. 
 This rate goes up with more queries, down with less, but try a bigger cache, 
 especially if you are updating the index infrequently, like once per day.
 
 At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP 
 cache in front of it. The HTTP cache had an 80% hit rate.
 
 I'd increase your document cache, too. I usually see about 0.75 or better on 
 that.
 
 wunder
 
 On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:
 
 Hi Otis, 
 
 Yes the query results cache is just about worthless.   I guess we have too 
 diverse of a set of user queries.  The business unit has decided to let bots 
 crawl our search pages too so that doesn't help either.  I turned it way 
 down but decided to keep it because my understanding was that it would still 
 help for users going from page 1 to page 2 in a search.  Is that true?
 
 Thanks
 Robi
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
 Sent: Monday, June 17, 2013 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Hi Robi,
 
 This goes against the original problem of getting OOMEs, but it looks like 
 each of your Solr caches could be a little bigger if you want to eliminate 
 evictions, with the query results one possibly not being worth keeping if 
 you can't get the hit % up enough.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 
 
 On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly 
 int IDs and there are some dynamic fields also.  These dynamic fields can 
 be used for custom faceting.  We have some standard facets we always facet 
 on and other dynamic facets which are only used if the query is filtering 
 on a particular category.  There are hundreds of these fields but since 
 they are only for a small subset of the overall index they are very 
 sparsely populated with regard to the overall index.  With CMS GC we get a 
 sawtooth on the old generation (I guess every replication and commit causes 
 it's usage to drop down to 10GB or so) and it seems to be the old 
 generation which is the main space consumer.  With the G1GC, the memory map 
 looked totally different!  I was a little lost looking at memory 
 consumption with that GC.  Maybe I'll try it again now that the index is a 
 bit smaller than it was last time I tried it.  After four days without 
 running an optimize now it is 21GB.  BTW our indexing speed is mostly bound 
 by the DB so reducing the segments might be ok...
 
 Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
 but unfortunately I guess I can't send the history graphics to the 
 solr-user list to show their 

RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
We actually have hundreds of facet-able fields, but most are specialized and 
are only faceted upon if the user has drilled into the particular category to 
which they are applicable and so they are only indexed for products in those 
categories.  I guess it is the facets that eat up so much of our memory.  It 
was suggested that if I use facet method = enum for those particular 
specialized facets then my memory usage would go down.  I'm going to try that 
out and see how much it helps.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, June 19, 2013 10:50 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

I generally run with an 8GB heap for a system that does no faceting. 32GB does 
seem rather large, but you really should have room for bigger caches.

The Akamai cache will reduce your hit rate a lot. That is OK, because users are 
getting faster responses than they would from Solr. A 5% hit rate may be OK 
since you have that front end HTTP cache.

The Netflix index was updated daily. 

wunder

On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote:

 Hi Walter,
 
 I used to have larger settings on our caches but it seemed like I had to make 
 the caches that small to reduce memory usage to keep from getting the dreaded 
 OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
 slave farm has a load balancer in front of twelve slave servers and our index 
 is being updated constantly, pretty much 24/7.  
 
 So my question would be how do you run with such big caches without going 
 into the OOM zone?  Was the Netflix index only updated based upon the release 
 schedules of the studios, like once a week?  Our entertainment stores used to 
 be like that before we turned into a marketplace based e-tailer, but now we 
 get new listings from merchants all the time and so have a constant churn of 
 additions and deletions in our index.
 
 I feel like at 32GB our heap is really huge, but we seem to use almost all of 
 it with these settings.   I am trying out the G1GC on one slave to see if 
 that gets memory usage lower but while it has a different collection pattern 
 in the various spaces it seems like the total memory usage peaks out at about 
 the same level.
 
 Thanks
 Robi
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Tuesday, June 18, 2013 6:57 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Your query cache is far too small. Most of the default caches are too small.
 
 We run with 10K entries and get a hit rate around 0.30 across four servers. 
 This rate goes up with more queries, down with less, but try a bigger cache, 
 especially if you are updating the index infrequently, like once per day.
 
 At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP 
 cache in front of it. The HTTP cache had an 80% hit rate.
 
 I'd increase your document cache, too. I usually see about 0.75 or better on 
 that.
 
 wunder
 
 On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:
 
 Hi Otis, 
 
 Yes the query results cache is just about worthless.   I guess we have too 
 diverse of a set of user queries.  The business unit has decided to let bots 
 crawl our search pages too so that doesn't help either.  I turned it way 
 down but decided to keep it because my understanding was that it would still 
 help for users going from page 1 to page 2 in a search.  Is that true?
 
 Thanks
 Robi
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
 Sent: Monday, June 17, 2013 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Hi Robi,
 
 This goes against the original problem of getting OOMEs, but it looks like 
 each of your Solr caches could be a little bigger if you want to eliminate 
 evictions, with the query results one possibly not being worth keeping if 
 you can't get the hit % up enough.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 
 
 On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly 
 int IDs and there are some dynamic fields also.  These dynamic fields can 
 be used for custom faceting.  We have some standard facets we always facet 
 on and other dynamic facets which are only used if the query is filtering 
 on a particular category.  There are hundreds of these fields but since 
 they are only for a small subset of the overall index they are very 
 sparsely populated with regard to the overall index.  With CMS GC we get a 
 sawtooth on the old generation (I guess every 

solr spatial search with distance to search results

2013-06-19 Thread PeterKerk
I was reading this: http://wiki.apache.org/solr/SpatialSearch

I have this Solr query:

http://localhost:8983/solr/tt/select/?indent=onfacet=truefq={!geofilt}pt=51.4416420,5.4697225sfield=geolocationd=20sort=geodist()%20ascq=*:*start=0rows=10fl=_dist_:geodist(),id,title,lat,lng,locationfacet.mincount=1

And this in my schema.xml

fieldType name=location class=solr.LatLonType
subFieldSuffix=_coordinate/  

field name=geolocation type=location indexed=true stored=true/

dynamicField name=*_coordinate  type=tdouble indexed=true
stored=false/


However, with my current query string, I don't see a distance field in the
document, and also no location field.

What am I missing?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-spatial-search-with-distance-to-search-results-tp4071745.html
Sent from the Solr - User mailing list archive at Nabble.com.


fq vs q parameter

2013-06-19 Thread Learner
Hi,

I am currently using the below configuration in one of my handler and I was
thinking of removing the values from q parameter and including as a part of
fq parameter.

Can someone let me know if there is any performance improvement when using
fq parameter compared to q? 

  str name=q
(
  _query_:{!dismax qf=person_name_lname_i v=$fps_lname}^8.3 OR
)

  /str
/lst
  lst name=appends
  str name=fq{!switch case='*:*' default=$fq_bbox
v=$fps_latlong}/str
  /lst
  lst name=invariants
  str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo
d=$fps_dist}^0.2/str
  /lst




--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: fq vs q parameter

2013-06-19 Thread Michael Della Bitta
Yes, definitely, fq parameters don't affect scoring and can be cached.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Wed, Jun 19, 2013 at 4:27 PM, Learner bbar...@gmail.com wrote:

 Hi,

 I am currently using the below configuration in one of my handler and I was
 thinking of removing the values from q parameter and including as a part of
 fq parameter.

 Can someone let me know if there is any performance improvement when using
 fq parameter compared to q?

   str name=q
 (
   _query_:{!dismax qf=person_name_lname_i v=$fps_lname}^8.3 OR
 )

   /str
 /lst
   lst name=appends
   str name=fq{!switch case='*:*' default=$fq_bbox
 v=$fps_latlong}/str
   /lst
   lst name=invariants
   str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo
 d=$fps_dist}^0.2/str
   /lst




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: fq vs q parameter

2013-06-19 Thread adityab
I see that your query has boost value so this mean you need Solr to Score on
each match document. 

One of the key difference between q and fq is thats fq will not have any
impact on score. where as having it in q will score each document based on
the Similarity Score. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748p4071758.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: fq vs q parameter

2013-06-19 Thread adityab
+1 
q and fq both can be cached.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748p4071759.html
Sent from the Solr - User mailing list archive at Nabble.com.


Informal poll on running Solr 4 on Java 7 with G1GC

2013-06-19 Thread Timothy Potter
I'm sure there's some site to do this but wanted to get a feel for
who's running Solr 4 on Java 7 with G1 gc enabled?

Cheers,
Tim


Re: Adding documents in Solr plugin

2013-06-19 Thread Otis Gospodnetic
I think this makes sense.  Timothy asked about update by query in
the last 24 hours and this sounds like the same thing.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Jun 19, 2013 at 3:52 AM, Avner Levy av...@checkpoint.com wrote:
 I have a core with millions of records.
 I want to add a custom handler which scan the existing documents and update 
 one of the field  (delete and add document) based on a condition (age12 for 
 example).
 All fields are stored so there is no problem to recreate the document from 
 the search result.
 I prefer doing it on the Solr server side for avoiding sending millions of 
 documents to the client and back.
 I'm thinking of writing a solr plugin which will receive a query and update 
 some fields on the query documents (like the delete by query handler).
 Are existing solutions or better alternatives?
 I couldn't find any examples of Solr plugins which update / add / delete 
 documents (I don't need to extend the update handler).
 If someone has an example it will be great help.
 Thanks in advance


Re: Merge tool based on mergefactor

2013-06-19 Thread Otis Gospodnetic
Hi,

On Wed, Jun 19, 2013 at 3:52 AM, Cosimo Streppone cos...@streppone.it wrote:
 On 06/19/2013 03:21 AM, Otis Gospodnetic wrote:

 You could call the optimize command directly on slaves, but specify
 the target number of segments, e.g.
 /solr/update?optimize=truemaxSegments=10

 Not sure I recommend doing this on slaves, but you could - maybe you
 have spare capacity.  You may also want to consider not doing it on
 all your slaves at the same time...

 IIUC this assumes your slaves do not replicate too often,
 otherwise replication would reset the index to whatever
 number of segments the master has.

 You could still perform an optimize with maxSegments after
 every replication, if it's acceptable in the situation
 you are in.

 However, if you need slaves to update every 2-5 minutes,
 that would be impractical and wasteful.

 Is this correct?

Correct.

 If so, how to find a fair compromise/balance between
 master and slave merge factors if you need very frequent indexing
 of new documents (say continuous) on the master and up-to-date
 indexes on the slaves (say 2-5' pollInterval)?

If you need that you go to SolrCloud and start using softCommits.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/


update solr.xml dynamically to add new cores

2013-06-19 Thread smanad
Hi, 
Is there a way to edit solr.xml as a part of debian package installation to
add new cores.
In my use case, there 4 solr indexes and they are managed/configured by
different teams. 
The way I am thinking packages will work is as described below,
1. There will be a solr-base debian package which comes with solr
installtion with tomcat setup (I am planning to use solr 4.3)
2. There will be individual index debian packages like, 
solr-index1, solr-index2 which will be dependent on solr-base. 
Each package's DEBIAN postinst script will have a logic to edit solr.xml to
add new index like index1, index2, etc.

Does this sound good? or is there a better/different way to do this?
Any pointers will be much appreciated.
Thanks, 
-M



--
View this message in context: 
http://lucene.472066.n3.nabble.com/update-solr-xml-dynamically-to-add-new-cores-tp4071800.html
Sent from the Solr - User mailing list archive at Nabble.com.


Partial update using solr 4.3 with csv input

2013-06-19 Thread smanad
I was going through this link 
http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/ and one of
the comments is about support for csv. 

Since the comment is almost a year old, just wondering if this is still true
that, partial updates are possible only with xml and json input?

Thanks, 
-M



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-update-using-solr-4-3-with-csv-input-tp4071801.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial update using solr 4.3 with csv input

2013-06-19 Thread Jack Krupansky
Correct, no atomic update for CSV format. There just isn't any place to put 
the atomic update options in such a simple text format.


-- Jack Krupansky

-Original Message- 
From: smanad

Sent: Wednesday, June 19, 2013 8:30 PM
To: solr-user@lucene.apache.org
Subject: Partial update using solr 4.3 with csv input

I was going through this link
http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/ and one of
the comments is about support for csv.

Since the comment is almost a year old, just wondering if this is still true
that, partial updates are possible only with xml and json input?

Thanks,
-M



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-update-using-solr-4-3-with-csv-input-tp4071801.html
Sent from the Solr - User mailing list archive at Nabble.com. 



SolrCloud - Score calculation

2013-06-19 Thread Learner
Hi,

Sorry if its a very basic question but I am pretty new to SolrCloud and I am
trying to understand the underlying mechanism for calculating relevancy.

Currently we are using SOLR 3.6.X and we use shards to perform distributed
searching. Our shards are not of equal size hence sometimes the results are
not as we expected. 

For ex: Shard 1 has 30 million documents, Shard 2 has 30 millon documents
and shard 3 has just 3 million documents (push indexing via message queue). 

When we do a search using shards, documents from shard 1 and shard 2 gets
higher priority compared to documents in shard 3 (since its smaller).
Currently we add index time boost when adding documents to shard 3 so that
the documents from shard 3 also comes up (higher) in search results.

Now when using SolrCloud, say for example if one shard has person name
repeated 5 times (with different unique id)  and we have one more same
person name in shard 2 (with diff id), and when we do a search how does SOLR
calculate the score? Does it do something like constant scoring across
various shards in order to bring up the search results across various
shards? How does the score gets calculated.. Does the score of all 6
documents have same value(5 from shard 1 and 1 from shard 2 -if all the
fields have same value except for unique id)? 

Thanks,
BB 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - Score calculation

2013-06-19 Thread Upayavira
The reason for the issue you are seeing is the IDF component in te
score. IDF = inverse document frequency.

The document frequency is the number of times a document appears in the
index. The higher the document frequency, the mre common the term and
thus the less relevant it is. The document frequency is inverted to give
a higher number for more relevant terms.

Solr does not yet support distributed IDF. Therefore the document
frequency is a 3m shard will be higher (as a proportion of your index)
compared to your 30m shard, thus it ill score lower.

I am not aware of a multiplier you can use to fix this. There is a
distributed IDF ticket in JIRA, maybe that is mature enough and might
help you.

Upayavira

On Thu, Jun 20, 2013, at 01:56 AM, Learner wrote:
 Hi,
 
 Sorry if its a very basic question but I am pretty new to SolrCloud and I
 am
 trying to understand the underlying mechanism for calculating relevancy.
 
 Currently we are using SOLR 3.6.X and we use shards to perform
 distributed
 searching. Our shards are not of equal size hence sometimes the results
 are
 not as we expected. 
 
 For ex: Shard 1 has 30 million documents, Shard 2 has 30 millon documents
 and shard 3 has just 3 million documents (push indexing via message
 queue). 
 
 When we do a search using shards, documents from shard 1 and shard 2 gets
 higher priority compared to documents in shard 3 (since its smaller).
 Currently we add index time boost when adding documents to shard 3 so
 that
 the documents from shard 3 also comes up (higher) in search results.
 
 Now when using SolrCloud, say for example if one shard has person name
 repeated 5 times (with different unique id)  and we have one more same
 person name in shard 2 (with diff id), and when we do a search how does
 SOLR
 calculate the score? Does it do something like constant scoring across
 various shards in order to bring up the search results across various
 shards? How does the score gets calculated.. Does the score of all 6
 documents have same value(5 from shard 1 and 1 from shard 2 -if all the
 fields have same value except for unique id)? 
 
 Thanks,
 BB 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Informal poll on running Solr 4 on Java 7 with G1GC

2013-06-19 Thread Shawn Heisey
On 6/19/2013 4:18 PM, Timothy Potter wrote:
 I'm sure there's some site to do this but wanted to get a feel for
 who's running Solr 4 on Java 7 with G1 gc enabled?

I have tried it, but found that G1 didn't give me any better GC pause
characteristics than CMS without tuning, and may have actually been
worse.  Now I use CMS with several tuning options.

Thanks,
Shawn



Re: PostingsSolrHighlighter not working on Multivalue field

2013-06-19 Thread Floyd Wu
Hi Erick,

multivalue is my typo, thanks for your reminding.

There is no log show anything wrong or exception occurred.

The field definition as following

field name=summary type=text indexed=true stored=true
omitNorms=false termVectors=true termPositions=true
termOffsets=true storeOffsetsWithPositions=true/

dynamicField name=* type=text indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true omitNorms=false storeOffsetsWithPositions=true/

The PostingSolrHighlighter only do highlight on summary field.

When I send a xml file to solr like this

?xml version=1.0 encoding=utf-8?
command
  add
doc
  field name=summaryfacebook yahoo plurk twitter social
nextworing/field
  field name=body_0facebook yahoo plurk twitter social
nextworing/field
/doc
  /add
/command

As you can see the body_0 will be treated using dynamicField definition.

Part of the debug response return of Solr like this

lst name=highlighting
  lst name=645
arr name=summary
  stremFacebook/em... emFacebook/em/str
/arr
arr name=body_0/

/lst

I'm sure hl.fl contains both summary and body_0.
This behavior is different between PostingSolrHighlighter and
FastVectorhighlighter.

Please kindly help on this.
Many thanks.

Floyd



2013/6/19 Erick Erickson erickerick...@gmail.com

 Well, _how_ does it fail? unless it's a type it should be
 multiValued (not capital 'V'). This probably isn't the
 problem, but just in case.

 Anything in the logs? What is the field definition?
 Did you re-index after changing to multiValued?

 Best
 Erick

 On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu floyd...@gmail.com wrote:
  In my test case, it seems this new highlighter not working.
 
  When field set multivalue=true, the stored text in this field can not be
  highlighted.
 
  Am I miss something? Or this is current limitation? I have no luck to
 find
  any documentations mentioned this.
 
  Floyd



RE: Solr 4.2 in SolrCloud mode lost response for update but search is normal

2013-06-19 Thread Kevin Xiang
From the coredump information,it seem that the issue is the same as the jira:
https://issues.apache.org/jira/browse/SOLR-4400: Rapidly opening and closing 
cores can lead to deadlock
Mark Miller:
Does the issue happen again?

Thanks.
From: Qun Wang
Sent: 2013年6月20日 11:24
To: solr-user@lucene.apache.org (solr-user@lucene.apache.org)
Subject: Solr 4.2 in SolrCloud mode lost response for update but search is 
normal

Hi, all:

I’m using SolrCloud with Solr 4.2, and currently a strange issue often happen 
and confuse me.
The running env has three zookeepers and two solrs, used the same shard and six 
cores.
Without any network and resource issue, I found that after a time, solr would 
lost response for update but query is normal.
Servers neither process any update nor generate response info, seems somewhere 
entered a dead lock.
Thread dump info for on machine is attached.

Could someone help me check what’s the root issue?

Thanks!
__
Qun Wang
Application Service - Backend
Morningstar (Shenzhen) Ltd.
Morningstar. Illuminating investing worldwide.
+86 755 3311 0218 Office
+86 186 6538 3975 Mobile
qun.w...@morningstar.commailto:qun.w...@morningstar.com%0d
This e-mail contains privileged and confidential information and is intended 
only for the use of the person(s) named above. Any dissemination, distribution 
or duplication of this communication without prior written consent from 
Morningstar is strictly prohibited. If you received this message in error 
please contact the sender immediately and delete the materials from any 
computer.