date:20130712

CloudSolrServer uses LBHttpSolrServer by default. CloudSolrServer connects
to Zookeeper and passes the live nodes
to LBHttpSolrServer. LBHttpSolrServer connects each node as round robin. By
the way do you mean leader instead of master?

2013/7/12 sathish_ix skandhasw...@inautix.co.in

 Hi ,

 Iam using cloudsolrserver to connect to solrcloud, im indexing the
 documents
 using solrj API using cloudsolrserver object. Index is triggered on master
 node of a collection, whereas if i need to find the status of the loading ,
 it return the message from replica where status is null. How to find which
 instance the cloudsolrserver is connecting ?





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Usage-of-CloudSolrServer-tp4056052p4077471.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Leader Election, when?

If you want to plan to have 2 shards and if you start up the first node it
will be the leader of first shard. When you start up second node it will be
the leader of second shard. If you start up third node it will be the
replica of first shard. If you start up fourth node it will be the replica
of second shard. If you start up fifth node it will be the replica of first
shard ... and this will continue as like that as round robin.


2013/7/11 aabreur alexandre.ab...@vtex.com.br

 I have a working Zookeeper ensemble running with 3 instances and also a
 solrcloud cluster with some solr instances. I've created a collection with
 settings to 2 shards. Then i:

 create 1 core on instance1
 create 1 core on instance2
 create 1 core on instance1
 create 1 core on instance2

 Just to have this configuration:

 instance1: shard1_leader, shard2_replica
 instance2: shard1_replica, shard2_leader

 If i add 2 cores to instance1 then 2 cores to instance2, both leaders will
 be on instance1 and no re-election is done.

 instance1: shard1_leader, shard2_leader
 instance2: shard1_replica, shard2_replica

 Back to my ideal scenario (detached leaders), also when i add a third
 instance with 2 replicas and kill one of my instances running a leader, the
 election picks the instance that already have a leader.

 My question is why Zookeeper takes this behavior. Shouldn't it distribute
 leaders? If i deliver some stress to a double-leader instance, is Zookeeper
 going to run an election?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

If you have one collection you just need to define hostnames of Zookeeper
ensembles and run that command once.


2013/7/11 Zhang, Lisheng lisheng.zh...@broadvision.com

 Hi,

 We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to
 4.3.0), in WIKI page
 for solrCloud in Tomcat:

 http://wiki.apache.org/solr/SolrCloudTomcat

 we need to link each collection explicitly:

 ///
 8) Link uploaded config with target collection
 java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI
 -cmd linkconfig -collection mycollection -confname ...
 ///

 But our application has many cores (a few thousands which all share same
 schema/config,
 is there a moe convenient way ?

 Thanks very much for helps, Lisheng

Re: Performance of cross join vs block join

2013-07-12 Thread mihaela olteanu

Hi Mikhail,

I have used wrong the term block join. When I said block join I was referring 
to a join performed on a single core versus cross join which was performed on 
multiple cores.
But I saw your benchmark (from cache) and it seems that block join has better 
performance. Is this functionality available on Solr 4.3.1? I did not find such 
examples on Solr's wiki page.
Does this functionality require a special schema, or a special indexing? How 
would I need to index the data from my tables? In my case anyway all the 
indices have a common schema since I am using dynamic fields, thus I can easily 
add all documents from all tables in one Solr core, but for each document to 
add a discriminator field. 

Could you point me to some more documentation?

Thanks in advance,
Mihaela



 From: Mikhail Khludnev mkhlud...@griddynamics.com
To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
mihaela...@yahoo.com 
Sent: Thursday, July 11, 2013 2:25 PM
Subject: Re: Performance of cross join vs block join
 

Mihaela,

For me it's reasonable that single core join takes the same time as cross
core one. I just can't see which gain can be obtained from in the former
case.
I hardly able to comment join code, I looked into, it's not trivial, at
least. With block join it doesn't need to obtain parentId term
values/numbers and lookup parents by them. Both of these actions are
expensive. Also blockjoin works as an iterator, but join need to allocate
memory for parents bitset and populate it out of order that impacts
scalability.
Also in None scoring mode BJQ don't need to walk through all children, but
only hits first. Also, nice feature is 'both side leapfrog' if you have a
highly restrictive filter/query intersects with BJQ, it allows to skip many
parents and children as well, that's not possible in Join, which has fairly
'full-scan' nature.
Main performance factor for Join is number of child docs.
I'm not sure I got all your questions, please specify them in more details,
if something is still unclear.
have you saw my benchmark
http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?



On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote:

 Hello,

 Does anyone know about some measurements in terms of performance for cross
 joins compared to joins inside a single index?

 Is it faster the join inside a single index that stores all documents of
 various types (from parent table or from children tables)with a
 discriminator field compared to the cross join (basically in this case each
 document type resides in its own index)?

 I have performed some tests but to me it seems that having a join in a
 single index (bigger index) does not add too much speed improvements
 compared to cross joins.

 Why a block join would be faster than a cross join if this is the case?
 What are the variables that count when trying to improve the query
 execution time?

 Thanks!
 Mihaela




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Solr 4.3 Shard distributed request check probably incorrect?

2013-07-12 Thread Johann Höchtl

Hi,

we are using Solr 4.3 with regular sharding without ZooKeeper.

I see the following errors inside our logs:
14995742 [qtp427093680-2249] INFO  org.apache.solr.core.SolrCore  - [DE1]
webapp=/solr path=/select
params={mm=266%25tie=0.1ids=1060691781qf=Title^1.2+Description^0.01+Keywords^0.4+ArtikelNumber^0.1distrib=falseq.alt=*:*wt=javabinversion=2rows=10defType=edismaxpf=%0aTitle^1.5+Description^0.3%0a+NOW=1373459092416shard.url=
172.31.4.63:8080/solr/DE1fl=%0aPID,updated,score%0a+start=0q=9783426647240bf=%0a%0a+partialResults=truetimeAllowed=5000isShard=truefq=Price:[*+TO+9]fq=ShopId1+8+10+12+2975)ps=100}
status=0 QTime=2
14995742 [qtp427093680-2255] ERROR
org.apache.solr.servlet.SolrDispatchFilter  -
null:java.lang.NullPointerException
at
org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:727)
at
org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:588)
at
org.apache.solr.handler.component.QueryComponent.distributedProcess(QueryComponent.java:541)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:244)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

Because this is obviously a distributed request isShard=truedistrib=false
shouldn't this request evaluate to a non-distributed request? It seems that
the ResponseBuilder is marked as isDistrib == true, because otherwise it
wouldn't execute the distributedProcess or am I wrong?


Best regards,
Hans

About Suggestions

2013-07-12 Thread Lochschmied, Alexander

Hi Solr people!

We need to suggest part numbers in alphabetically order adding up to four 
characters to the already entered part number prefix. That works quite well 
with terms component acting on a multivalued field with keyword tokenizer and 
edge nGram filter. I am mentioning part numbers to indicate that each item in 
the multivalued field is a string without whitespace and where special 
characters like dashes cannot be seen as separators.

Is there a way to know if the term (the suggestion) represents such a complete 
part number (without doing another query for each suggestion)?

Since we are using SolJ, what we would need is something like
boolean Term.isRepresentingCompleteFieldValue()

Thanks,
Alexander

Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev

On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.comwrote:

 Hi Mikhail,

 I have used wrong the term block join. When I said block join I was
 referring to a join performed on a single core versus cross join which was
 performed on multiple cores.
 But I saw your benchmark (from cache) and it seems that block join has
 better performance. Is this functionality available on Solr 4.3.1?

nope SOLR-3076 awaits for ages.


 I did not find such examples on Solr's wiki page.
 Does this functionality require a special schema, or a special indexing?

Special indexing - yes.


 How would I need to index the data from my tables? In my case anyway all
 the indices have a common schema since I am using dynamic fields, thus I
 can easily add all documents from all tables in one Solr core, but for each
 document to add a discriminator field.

correct. but notion of ' discriminator field' is a little bit different for
blockjoin.



 Could you point me to some more documentation?


I can recommend only those
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
http://www.youtube.com/watch?v=-OiIlIijWH0


 Thanks in advance,
 Mihaela


 
  From: Mikhail Khludnev mkhlud...@griddynamics.com
 To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
 mihaela...@yahoo.com
 Sent: Thursday, July 11, 2013 2:25 PM
 Subject: Re: Performance of cross join vs block join


 Mihaela,

 For me it's reasonable that single core join takes the same time as cross
 core one. I just can't see which gain can be obtained from in the former
 case.
 I hardly able to comment join code, I looked into, it's not trivial, at
 least. With block join it doesn't need to obtain parentId term
 values/numbers and lookup parents by them. Both of these actions are
 expensive. Also blockjoin works as an iterator, but join need to allocate
 memory for parents bitset and populate it out of order that impacts
 scalability.
 Also in None scoring mode BJQ don't need to walk through all children, but
 only hits first. Also, nice feature is 'both side leapfrog' if you have a
 highly restrictive filter/query intersects with BJQ, it allows to skip many
 parents and children as well, that's not possible in Join, which has fairly
 'full-scan' nature.
 Main performance factor for Join is number of child docs.
 I'm not sure I got all your questions, please specify them in more details,
 if something is still unclear.
 have you saw my benchmark
 http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?



 On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com
 wrote:

  Hello,
 
  Does anyone know about some measurements in terms of performance for
 cross
  joins compared to joins inside a single index?
 
  Is it faster the join inside a single index that stores all documents of
  various types (from parent table or from children tables)with a
  discriminator field compared to the cross join (basically in this case
 each
  document type resides in its own index)?
 
  I have performed some tests but to me it seems that having a join in a
  single index (bigger index) does not add too much speed improvements
  compared to cross joins.
 
  Why a block join would be faster than a cross join if this is the case?
  What are the variables that count when trying to improve the query
  execution time?
 
  Thanks!
  Mihaela




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

 http://www.griddynamics.com
mkhlud...@griddynamics.com

Search with punctuations

2013-07-12 Thread kobe.free.wo...@gmail.com

Hi,

Scenario: 

User who perform search forget to put punctuation mark (apostrophe) for ex,
when user wants to search for a value like INT'L, they just key in INTL
(with no punctuation). In this scenario, I wish to return both values with
INTL and INT'L that currently are indexed on SOLR instance. Currently, if I
search for INTL it wont return the row having value INT'L.

Schema Configuration entry for the field type:

fieldType name=customStr class=solr.TextField
positionIncrementGap=100 sortMissingLast=true
  analyzer type=index
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.TrimFilterFactory /
   filter class=solr.PatternReplaceFilterFactory
pattern=\s*[,.]\s* replacement=  replace=all /
   filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all /
   filter class=solr.PatternReplaceFilterFactory pattern=[';]
replacement= replace=all /
  /analyzer
  analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory
pattern=\s*[,.]\s* replacement=  replace=all /
   filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all /
   filter class=solr.PatternReplaceFilterFactory pattern=[';]
replacement= replace=all/
  /analyzer
/fieldType

Please suggest as to what mechanism should I use to fetch both the values
like INTL and INT'L, when the search is performed for INTL. Also, does the
reg-ex look correct for the analyzers? What all different filters/ tokenizer
can be used to overcome this issue.

Thanks!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to set a condition on the number of docs found

Do you want to modify Solr source code? Did you check that line at
XMLWriter.java :

*writeAttr(numFound,Long.toString(numFound));*




2013/7/12 Matt Lieber mlie...@impetus.com

 Hello there,

 I would like to be able to know whether I got over a certain threshold of
 doc results.

 I.e. Test (Result.numFound  10 ) - true.

 Is there a way to do this ? I can't seem to find how to do this; (other
 than have to do this test on the client app, which is not great).

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

Re: Solr Live Nodes not updating immediately

2013-07-12 Thread Ranjith Venkatesan

Hi,

tickTime in zookeeper was high. When i reduced it to 2000ms solr node status
gets updated in 20s. Hence resolved my issue. Thanks for helping me.

I have one more question.

1. Is it advisable to reduce the tickTime further.

2. Or whats the most appropriate tickTime which gives maximum performance
and also solr node gets updated in lesser time.

I hereby included my zoo.cfg configuration

tickTime=2000
dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata
clientPort = 2181
initLimit=5
syncLimit=2
maxClientCnxns=180
server.1=localhost:2888:3888
server.2=localhost:3000:4000
server.3=localhost:2500:3500




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Live-Nodes-not-updating-immediately-tp4076560p4077467.html
Sent from the Solr - User mailing list archive at Nabble.com.

Custom processing in Solr Request Handler plugin and its debugging ?

2013-07-12 Thread Tony Mullins

Hi,

I have defined my new Solr RequestHandler plugin like this in SolrConfig.xml

requestHandler name=/myendpoint class=com.abc.MyRequestPlugin
/requestHandler

And its working fine.

Now I want to do some custom processing from my this plugin by making a
search query to regular '/select' handler.
 requestHandler name=/select class=solr.SearchHandler
 
/requestHandler

And then receive the results back from '/select' handler and perform some
custom processing on those results and send the response back to my custom
/myendpoint handler.

And for this I need help on how to make a call to '/select' handler from
within the .MyRequestPlugin class and perform some calculation on the
results.

I also need some help on how to debug my plugin ? As its .jar is been
deployed to solr_hom/lib ... how can I attach my plugin's code in eclipse
to Solr process so I could debug it when user will send request to my
plugin.

Thanks,
Tony

SolrCloud group.query error shard X did not set sort field values or how i can set fillFields=true on IndexSearcher.search

2013-07-12 Thread Evgeny Salnikov

Hi!

To repeat the problem, do the following

1. Start a node1 of SolrCloud (4.3.1 default configs) (java
-Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf
-DzkRun -jar start.jar)
2. Import to collection1 - shard1 some data
3. Try group.query e.g.
http://node1:8983/solr/collection1/select?q=*:*group=truegroup.query=someFiled:someValue.
it is important to have hit on index data.
4. The result is, there is no error
5. Start a node2 of SolrCloud (java -Djetty.port=7574
-DzkHost=localhost:9983 -jar start.jar)
6. On node2 add new core for collection1 - shard2. Default core
collection1 unload. We have one collection over two shard. Shard1 - have
data, shard2 - no data.
7. Again try group.query
http://node1:8983/solr/collection1/select?q=*:*group=truegroup.query=someFiled:someValue
.
8. Error: shard 0 did not set sort field values (FieldDoc.fields is null);
you must pass fillFields=true to IndexSearcher.search on each shard

How i can set fillFields=true to IndexSearcher.search ?

Thanks in advance,
Evgeny

How to optimize a search?

2013-07-12 Thread padcoe

Hello folks,

I'm doing a search for a specific word (Rocket Banana) in a specific field
and the document with the result Rocket Banana (Single) never comes
first..and this is the result that should appear in first position...i've
tried to many ways to perform this search:

title:Rocket Banana
title:(Rocket AND Banana)
title:(Rocket OR Banana)
title:(Rocket^0.175 AND Banana^0.175)
title:(Rocket^0.175 ORBanana^0.175)

The order returned is basically like:

docfloat name=score12.106901/floatstr name=titleRocket
Rocket/str/doc
docfloat name=score12.007204/floatstr
name=titleRocket/str/doc
docfloat name=score12.007203/floatstr name=titleBanana Banana
Banana/str/doc
a lot of results
docfloat name=score10.398543/floatstr name=titleRocket Banana
(Single)/str/doc


How can i optimize my search and return the document that have the full
word that i've searched with a higher scores then others?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to optimize a search?

_Why_ should Rocket Banana (Single) come first?
Essentially you have some ordering in mind and
unless you can express it clearly you'll _never_ get
ideal ranking. Really.

But your particular issue can probably be solved by adding
a clause like OR rocket banana^5

And I suspect you haven't given us the entire query, or
you're running through edismax or whatever. In future,
please paste the result of adding debug=all to the
e-mail.

Best
Erick

On Fri, Jul 12, 2013 at 7:32 AM, padcoe davidpadi...@gmail.com wrote:
 Hello folks,

 I'm doing a search for a specific word (Rocket Banana) in a specific field
 and the document with the result Rocket Banana (Single) never comes
 first..and this is the result that should appear in first position...i've
 tried to many ways to perform this search:

 title:Rocket Banana
 title:(Rocket AND Banana)
 title:(Rocket OR Banana)
 title:(Rocket^0.175 AND Banana^0.175)
 title:(Rocket^0.175 ORBanana^0.175)

 The order returned is basically like:

 docfloat name=score12.106901/floatstr name=titleRocket
 Rocket/str/doc
 docfloat name=score12.007204/floatstr
 name=titleRocket/str/doc
 docfloat name=score12.007203/floatstr name=titleBanana Banana
 Banana/str/doc
 a lot of results
 docfloat name=score10.398543/floatstr name=titleRocket Banana
 (Single)/str/doc


 How can i optimize my search and return the document that have the full
 word that i've searched with a higher scores then others?





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Patch review request: SOLR-5001 (adding book links to the website)

2013-07-12 Thread Alexandre Rafalovitch

Hello,

As per earlier email thread, I have created a patch for Solr website to
incorporate links to my new book.

It would be nice if somebody with commit rights for the (markdown) website
could look at it before the book's Solr version (4.3.1) stops being the
latest :-)

I promise to help with the new Wiki/Guide later in return.

https://issues.apache.org/jira/browse/SOLR-5001

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Does Solrj Batch Processing Querying May Confuse?

I've crawled some webpages and indexed them at Solr. I've queried data at
Solr via Solrj. url is my unique field and I've define my query as like
that:

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(q, lang:tr);
params.set(fl, url);
params.set(sort, url desc);

I've run my program to query 1000 rows at each query and wrote them in a
file. However I realized that there are some documents that are indexed at
Solr (I query them from admin page, not from Solrj as a 1000 row batch
process) but is not at my file. What may be the problem for that?

Re: Problem using Term Component in solr

bq: Note:Term Component works only on string dataType field.  :(

Not true. Term Component will work on any indexed field. It'll bring
back the _tokens_ that have been indexed though, which are often
individual words so your examples medical physics would be two
separate tokens so it may be puzzling.

A general request, please don't put bold text. I know it's an attempt
to help direct attention to the important bits, but (at least in gmail in
my browser) bolds are replaced by * before and after, which
especially when looking at wildcard questions is really confusing G.

But I have to ask you to back up a bit. _Why_ are you using
TermsComponent to search titles? Why not use Solr for what
it's good for and just search a _tokenized_ title field? This feels
like an XY problem.

Best
Erick

On Thu, Jul 11, 2013 at 2:55 AM, Parul Gupta(Knimbus)
parulgp...@gmail.com wrote:
 Hi All

 I am using *Term component* in solr for searching titles with short form
 using wild card characters(.*) and [a-z0-9]*.

 I am using *Term Component* specifically as wild card characters are not
 working on *select?q=* query search.

 Examples of some *title *are:

 1)Medicine, Health Care and Philosophy
 2)Medical Physics
 3)Physics of fluids
 4)Medical Engineering and Physics

 ***When i do *solr query*:
 localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=phy.*
 fluidsterms.regex.flag=case_insensitiveterms.limit=10

 *Output* is 3rd title:
 *Physics of fluids*

 This is relevant output.

 ***But when i do *solr query*:

 localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=med.*
 phy.*terms.regex.flag=case_insensitiveterms.limit=10

 *Output* are 2nd and 4th title:

 *Medical Engineering and Physics*
 *Medical Physics*

 This is irrelevant.I want only one result for this query i.e. *Medical
 Physics*

 *Although i have changed my wild card
 characters to *[a-z0-9]** instead of *.** ,but than first query doesn't work
 as '*of*' is included in '*Physics of fluids*'.However Second query works
 fine .

 example of query is:

 localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=med[a-z0-9]*
 phy[a-z0-9]*terms.regex.flag=case_insensitiveterms.limit=10

 This works fine,gives one output *Medical Physics*.


 If there is another way for searching using *Term Component* or without
 using it..Please suggest to neglect such stop words.

 Note:Term Component works only on string dataType field.  :(




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr caching clarifications

Inline

On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 Hello,
 As a result of frequent java OOM exceptions, I try to investigate more into
 the solr jvm memory heap usage.
 Please correct me if I am mistaking, this is my understanding of usages for
 the heap (per replica on a solr instance):
 1. Buffers for indexing - bounded by ramBufferSize
 2. Solr caches
 3. Segment merge
 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.

 Particularly I'm concerned by Solr caches and segment merges.
 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet)
 and queryResultCaches (DocList)? I understand it is related to the skip
 spaces between doc id's that match (so it's not saved as a bitmap). But
 basically, is every id saved as a java int?

Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
can get the maxDoc number from your Solr admin page). Plus some overhead
for storing the fq text, but that's usually not much. This is for each
entry up to Size.

queryResultCache is usually trivial unless you've configured it extravagantly.
It's the query string length + queryResultWindowSize integers per entry
(queryResultWindowSize is from solrconfig.xml).

 2. QueryResultMaxDocsCached - (for example = 100) means that any query
 resulting in more than 100 docs will not be cached (at all) in the
 queryResultCache? Or does it have to do with the documentCache?
It's just a limit on the queryResultCache entry size as far as I can
tell. But again
this cache is relatively small, I'd be surprised if it used
significant resources.

 3. DocumentCache - written on the wiki it should be greater than
 max_results*concurrent_queries. Max result is just the num of rows
 displayed (rows-start) param, right? Not the queryResultWindow.

Yes. This a cache (I think) for the _contents_ of the documents you'll
be returning to be manipulated by various components during the life
of the query.

 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
 cache be used? (on the expense of eviction of docs that were already loaded
 with stored fields)

Not sure, but I don't think this will contribute much to memory pressure. This
is about now many fields are loaded to get a single value from a doc in the
results list, and since one is usually working with 20 or so docs this
is usually
a small amount of memory.

 5. How large is the heap used by mergings? Assuming we have a merge of 10
 segments of 500MB each (half inverted files - *.pos *.doc etc, half non
 inverted files - *.fdt, *.tvd), how much heap should be left unused for
 this merge?

Again, I don't think this is much of a memory consumer, although I
confess I don't
know the internals. Merging is mostly about I/O.


 Thanks in advance,
 Manu

But take a look at the admin page, you can see how much memory various
caches are using by looking at the plugins/stats section.

Best
Erick

RE: What happens in indexing request in solr cloud if Zookeepers are all dead?

2013-07-12 Thread Zhang, Lisheng

Thanks very much for your clear explanation!

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, July 11, 2013 1:55 PM
To: solr-user@lucene.apache.org
Subject: Re: What happens in indexing request in solr cloud if
Zookeepers are all dead?

Sorry, no updates if no Zookeepers. There would be no way to assure that any 
node knows the proper configuration. Queries are a little safer using most 
recent configuration without zookeeper, but update consistency requires 
accurate configuration information.

-- Jack Krupansky

-Original Message- 
From: Zhang, Lisheng
Sent: Thursday, July 11, 2013 2:59 PM
To: solr-user@lucene.apache.org
Subject: RE: What happens in indexing request in solr cloud if Zookeepers 
are all dead?

Yes, I should not have used word master/slave for solr cloud!

So if all Zookeepers are dead, could indexing requests be
handled properly (could solr remember the setting for indexing)?

Thanks very much for helps, Lisheng

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, July 11, 2013 10:46 AM
To: solr-user@lucene.apache.org
Subject: Re: What happens in indexing request in solr cloud if
Zookeepers are all dead?

There are no masters or slaves in SolrCloud - it is fully distributed and
master-free. Leaders are temporary and can vary over time.

The basic idea for quorum is to prevent split brain - two (or more)
distinct sets of nodes (zookeeper nodes, that is) each thinking they
constitute the authoritative source for access to configuration information.
The trick is to require (N/2)+1 nodes for quorum. For n=3, quorum would be
(3/2)+1 = 1+1 = 2, so one node can be down. For n=1, quorum = (1/2)+1 = 0 +
1 = 1. For n=2, quorum would be (2/2)+1 = 1 + 1 = 2, so no nodes can be
down. IOW, for n=2 no nodes can be down for the cluster to do updates.

-- Jack Krupansky

-Original Message- 
From: Zhang, Lisheng
Sent: Thursday, July 11, 2013 9:28 AM
To: solr-user@lucene.apache.org
Subject: What happens in indexing request in solr cloud if Zookeepers are
all dead?

Hi,

In solr cloud latest doc, it mentioned that if all Zookeepers are dead,
distributed
query still works because solr remembers the cluster state.

How about the indexing request handling if all Zookeepers are dead, does
solr
needs Zookeeper to know which box is master and which is slave for indexing
to
work? Could solr remember master/slave relations without Zookeeper?

Also doc said Zookeeper quorum needs to have a majority rule so that we must
have 3 Zookeepers to handle the case one instance is crashed, what would
happen if we have two instances in quorum and one instance is crashed (or
quorum
having 3 instances but two of them are crashed)? I felt the last one should
take
over?

Thanks very much for helps, Lisheng

Re: How to boost relevance based on distance and age..

the first thing I'd try would be FunctionQueries, see:
http://wiki.apache.org/solr/FunctionQuery.

Be a little careful. You have disjoint conditions, i.e.
one or the other should be used so you'll have two
function queries, basically expressing
if (age  20 years)
if (age = 20 years)

The one that _doesn't_ apply should return 1, not 0
since it'll be multiplied by the score.

Best
Erick

On Thu, Jul 11, 2013 at 11:03 AM, Vineel vine...@visionsoft-inc.com wrote:


 Here is the structure of the solr document

doc
  str name=latlong52.401790,4.936660/str
  date name=dateOfBirth1993-12-09T00:00:00Z/date
/doc

 would like to search for document's based on the following weighted
 criteria..

 - distance 0-10miles weight 40
 - distance 10miles and above weight 20
 - Age 0-20years weight 20
 - Age 20years and above weight 10

 wondering what are the recommended approaches to build SOLR queries for
 this?

 Thanks
 -Vineel




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-boost-relevance-based-on-distance-and-age-tp4077330.html
 Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Zhang, Lisheng

Sorry I might not have asked clearly, our issue is that we have 
a few thousand collections (can be much more), so running that 
command is rather tedius, is there a simpler way (all collections 
share same schema/config)?

Thanks very much for helps, Lisheng

-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Friday, July 12, 2013 1:17 AM
To: solr-user@lucene.apache.org
Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to
Zookeeper


If you have one collection you just need to define hostnames of Zookeeper
ensembles and run that command once.


2013/7/11 Zhang, Lisheng lisheng.zh...@broadvision.com

 Hi,

 We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to
 4.3.0), in WIKI page
 for solrCloud in Tomcat:

 http://wiki.apache.org/solr/SolrCloudTomcat

 we need to link each collection explicitly:

 ///
 8) Link uploaded config with target collection
 java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI
 -cmd linkconfig -collection mycollection -confname ...
 ///

 But our application has many cores (a few thousands which all share same
 schema/config,
 is there a moe convenient way ?

 Thanks very much for helps, Lisheng

Re: Request to be added to the ContributorsGroup

Done, at least to the Solr contributor's group, if you want
Lucene, let me know.

Added exactly as KumarLImbu, don't know whether
1 both the L and I should be capitalized
2 whether the rights-checking cares.

Thanks!
Erick

On Fri, Jul 12, 2013 at 2:51 AM, Kumar Limbu kumarli...@gmail.com wrote:
 Hi,

 My username is KumarLImbu and I would like to be added to the Contributors
 Group.

 Could somebody please help me?

 Best Regards,
 Kumar

Re: Leader Election, when?

This is probably not all that important to worry about. The additional
duties of a leader are pretty minimal. And the leaders will shift around
anyway as you restart servers etc. Really feels like a premature
optimization.

Best
Erick

On Thu, Jul 11, 2013 at 3:53 PM, aabreur alexandre.ab...@vtex.com.br wrote:
 I have a working Zookeeper ensemble running with 3 instances and also a
 solrcloud cluster with some solr instances. I've created a collection with
 settings to 2 shards. Then i:

 create 1 core on instance1
 create 1 core on instance2
 create 1 core on instance1
 create 1 core on instance2

 Just to have this configuration:

 instance1: shard1_leader, shard2_replica
 instance2: shard1_replica, shard2_leader

 If i add 2 cores to instance1 then 2 cores to instance2, both leaders will
 be on instance1 and no re-election is done.

 instance1: shard1_leader, shard2_leader
 instance2: shard1_replica, shard2_replica

 Back to my ideal scenario (detached leaders), also when i add a third
 instance with 2 replicas and kill one of my instances running a leader, the
 election picks the instance that already have a leader.

 My question is why Zookeeper takes this behavior. Shouldn't it distribute
 leaders? If i deliver some stress to a double-leader instance, is Zookeeper
 going to run an election?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Live Nodes not updating immediately

On 7/11/2013 11:11 PM, Ranjith Venkatesan wrote:
 tickTime in zookeeper was high. When i reduced it to 2000ms solr node status
 gets updated in 20s. Hence resolved my issue. Thanks for helping me.
 
 I have one more question.
 
 1. Is it advisable to reduce the tickTime further.
 
 2. Or whats the most appropriate tickTime which gives maximum performance
 and also solr node gets updated in lesser time.
 
 I hereby included my zoo.cfg configuration
 
 tickTime=2000
 dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata
 clientPort = 2181
 initLimit=5
 syncLimit=2
 maxClientCnxns=180
 server.1=localhost:2888:3888
 server.2=localhost:3000:4000
 server.3=localhost:2500:3500

Here's mine, comments removed.  Except for dataDir, these are all
default values found in the zookeeper download and on the zookeeper website:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=zoodata
clientPort=2181
server.1=zoo1.REDACTED.com:2888:3888
server.2=zoo2.REDACTED.com:2888:3888
server.3=zoo3.REDACTED.com:2888:3888

http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

I hope your config is a dev install, because if all your zookeepers are
running on the same server, you have no redundancy in the face of a
server failure.  Servers do fail, even if they have all the redundancy
features you can buy.

Thanks,
Shawn

Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

On 7/12/2013 7:29 AM, Zhang, Lisheng wrote:
 Sorry I might not have asked clearly, our issue is that we have 
 a few thousand collections (can be much more), so running that 
 command is rather tedius, is there a simpler way (all collections 
 share same schema/config)?

When you create each collection with the Collections API (http calls),
you tell it the name of a config set stored in zookeeper.  You can give
all your collections the same config set if you like.

If you manually create collections with the CoreAdmin API instead, you
must use the zkcli script included in Solr to link the collection to the
config set, which can be done either before or after the collection is
created.  The zkcli script provides some automation for the java command
that you were given by Furkan.

Thanks,
Shawn

Re: How to set a condition on the number of docs found

2013-07-12 Thread William Bell

Hmmm. One way is:

http://localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.field=idfacet.offset=10rows=0facet.limit=1http://hgsolr2devmstr:8983/solr/providersearch/select/?q=*%3A*facet=truefacet.field=cityfacet.offset=10rows=0facet.limit=1

If you have a result you have results  10.

Another way is to just look at it wth a facet.query and have your app deal
with it.

http:/localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numberofresults}state:COrows=0http://hgsolr2devmstr:8983/solr/providersearch/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numberofresults}state:COrows=0




On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello there,

 I would like to be able to know whether I got over a certain threshold of
 doc results.

 I.e. Test (Result.numFound  10 ) - true.

 Is there a way to do this ? I can't seem to find how to do this; (other
 than have to do this test on the client app, which is not great).

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: How to set a condition on the number of docs found

Test where? I mean, numFound is right there at the top of the query 
results, right?


Unfortunately there is no function query value source equivalent to 
numFound. There is numdocs, but that is the total documents in the 
index. There is also docfreq(term), which could be used in a function query 
(including the fl parameter) if you know a term that has a 1-to-1 
relationship to your query results.


It is worth filing a Jira to add numfound() as a function query value 
source.


-- Jack Krupansky

-Original Message- 
From: Matt Lieber

Sent: Friday, July 12, 2013 1:45 AM
To: solr-user@lucene.apache.org
Subject: How to set a condition on the number of docs found

Hello there,

I would like to be able to know whether I got over a certain threshold of
doc results.

I.e. Test (Result.numFound  10 ) - true.

Is there a way to do this ? I can't seem to find how to do this; (other
than have to do this test on the client app, which is not great).

Thanks,
Matt









NOTE: This message may contain information that is confidential, 
proprietary, privileged or otherwise protected by law. The message is 
intended solely for the named addressee. If received in error, please 
destroy and notify the sender. Any use of this email is prohibited when 
received in error. Impetus does not represent, warrant and/or guarantee, 
that the integrity of this communication has been maintained nor that the 
communication is free of errors, virus, interception or interference.

Re: Norms

2013-07-12 Thread William Bell

Thanks.

Yeah I don't really want the queryNorm on


On Wed, Jul 10, 2013 at 2:39 AM, Daniel Collins danwcoll...@gmail.comwrote:

 I don't know the full answer to your question, but here's what I can offer.

 Solr offers 2 types of normalisation, FieldNorm and QueryNorm.  FieldNorm
 is as the name suggests field level normalisation, based on length of the
 field, and can be controlled by the omitNorms parameter on the field.  In
 your example, fieldNorm is always 1.0, see below, so that suggests you have
 correctly turned off field normalisation on the name_edgy field.

 1.0 = fieldNorm(field=name_edgy, doc=231378)

 QueryNorm is what I'm still trying to get to the bottom of exactly :)  But
 its something that tries to normalise the results of different term queries
 so they are broadly comparable. You haven't supplied the query you've run ,
 but based on the qf, bf, I'm assuming it breaks down into a DisMax query on
 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure
 that the results of those 3 queries can be compared.  The exact details of
 it I'm still trying to get to the bottom of (any volunteers with more info
 chip in!)

 From earlier answers to the list, queryNorm is calculated in the Similarity
 object, I need to dig further, but that's probably a good place to start.



 On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote:

  I have a field that has omitNorms=true, but when I look at debugQuery I
 see
  that
  the field is being normalized for the score.
 
  What can I do to turn off normalization in the score?
 
  I want a simple way to do 2 things:
 
  boost geodist() highest at 1 mile and lowest at 100 miles.
  plus add a boost for a query=edgefield^5.
 
  I only want tf() and no queryNorm. I am not even sure I want idf() but I
  can probably live with rare names being boosted.
 
 
 
  The results are being normalized. See below. I tried dismax and edismax -
  bf, bq and boost.
 
  requestHandler name=autoproviderdist class=solr.SearchHandler
  lst name=defaults
  str name=echoParamsnone/str
  str name=defTypeedismax/str
  float name=tie0.01/float
  str name=fl
  display_name,city_state,prov_url,pwid,city_state_alternative
  /str
  !--
  str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6),
  0.1)^10/str
  --
  str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str
  int name=rows5/int
  str name=q.alt*:*/str
  str name=qfname_edgy^.9 name_edge^.9 name_word/str
  str name=grouptrue/str
  str name=group.fieldpwid/str
  str name=group.maintrue/str
  !-- str name=pfname_edgy/str do not turn on --
  str name=sortscore desc, last_name asc/str
  str name=d100/str
  str name=pt39.740112,-104.984856/str
  str name=sfieldstore_geohash/str
  str name=hlfalse/str
  str name=hl.flname_edgy/str
  str name=mm2-1 4-2 6-3/str
  /lst
  /requestHandler
 
  0.058555886 = queryNorm
 
  product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
  times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in
 231378),
  product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of:
 0.9 =
  boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
  queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
  product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
  idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
  doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
  product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of:
 0.9 =
  boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
  queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
  product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
  idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
  doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
  (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
  queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
  maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
  fieldWeight(name_word:nutting in 231378), product of: 1.0 =
  tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
  maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
  (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
  queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
  idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
  (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
  tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
  maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =
 
 
 sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))
 
 
 
  --
  Bill Bell
  billnb...@gmail.com
  cell 720-256-8076
 




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: Is it possible to find a leader from a list of cores in solr via java code

2013-07-12 Thread vicky desai

Hi,

As per the suggestions above I shifted my focus to using CloudSolrServer. In
terms of sending updates to the leaders and reducing network traffic it
works great. But i faced one problem in using CloudSolrServer is that it
opens too many connections as large as five thousand connections. My Code is
as follows

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 3);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 2);
HttpClient client = HttpClientUtil.createClient(params);
LBHttpSolrServer lbServer = new LBHttpSolrServer(client);
server = new CloudSolrServer(zkHost,lbServer);
server.setDefaultCollection(defaultColllection);


If there is only one instance of solr up then this works great. But in 1
shard 1 replica system it opens up too many connections in waiting state. Am
I doing something incorrect. Any help would be highly appreciated





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-find-a-leader-from-a-list-of-cores-in-solr-via-java-code-tp4074994p4077587.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What does too many merges...stalling in indexwriter log mean?

2013-07-12 Thread Tom Burton-West

Thanks Shawn,

Do you have any feeling for what gets traded off if we increase the
maxMergeCount?

This is completely new for us because we are experimenting with indexing
pages instead of whole documents.  Since our average document is about 370
pages, this means that we have increased the number of documents we are
asking Solr to index by a couple of orders of magnitude. (on the other hand
the size of the document decreases by a couple of orders of magnitude).
I'm not sure why increasing the number of documents (and reducing their
size) is causing more merges.  I'll have to investigate.

Tom


On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/11/2013 1:47 PM, Tom Burton-West wrote:

 We are seeing the message too many merges...stalling  in our indexwriter
 log.   Is this something to be concerned about?  Does it mean we need to
 tune something in our indexing configuration?


 It sounds like you've run into the maximum number of simultaneous merges,
 which I believe defaults to two, or maybe three.  The following config
 section in indexConfig will likely take care of the issue. This assumes
 3.6 or later, I believe that on older versions, this goes in
 indexDefaults.

   mergeScheduler class=org.apache.lucene.**index.**
 ConcurrentMergeScheduler
 int name=maxThreadCount1/int
 int name=maxMergeCount6/int
   /mergeScheduler

 Looking through the source code to confirm, this definitely seems like the
 case.  Increasing maxMergeCount is likely going to speed up your indexing,
 at least by a little bit.  A value of 6 is probably high enough for mere
 mortals, buy you guys don't do anything small, so I won't begin to
 speculate what you'll need.

 If you are using spinning disks, you'll want maxThreadCount at 1.  If
 you're using SSD, then you can likely increase that value.

 Thanks,
 Shawn

Multiple queries or Filtering Queries in Solr

2013-07-12 Thread dcode



My problem is I have n fields (say around 10) in Solr that are searchable,
they all are indexed and stored. I would like to run a query first on my
whole index of say 5000 docs which will hit around an average of 500 docs.
Next I would like to query using a different set of keywords on these 500
docs and NOT on the whole index.

So the first time I send a query a score will be generated, the second time
I run a query the new score generated should be based on the 500 documents
of the previous query, or in other words Solr should consider only these 500
docs as the whole index.

To summarise this, Index of 5000 will be filtered to 500 and then 50
(500050050). Its basically filtering but I would like to do this in Solr.

I have reasonable basic knowledge and still learning.

Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)

I would like this to be accomplish using a program and end-user will only
see 50 results. So faceting is not an option.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-queries-or-Filtering-Queries-in-Solr-tp4077574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to set a condition over stats result

sum(x, y, z) = x + y + z (sums those specific fields values for the current 
document)


sum(x, y) = x + y (sum of those two specific field values for the current 
document)


sum(x) = field(x) = x (the specific field value for the current document)

The sum function in function queries is not an aggregate function. Ditto 
for min and max.


-- Jack Krupansky

-Original Message- 
From: mihaela olteanu

Sent: Friday, July 12, 2013 1:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to set a condition over stats result

What if you perform sub(sum(myfieldvalue),100)  0 using frange?



From: Jack Krupansky j...@basetechnology.com
To: solr-user@lucene.apache.org
Sent: Friday, July 12, 2013 7:44 AM
Subject: Re: How to set a condition over stats result


None that I know of, short of writing a custom search component. Seriously, 
you could hack up a copy of the stats component with your own logic.


Actually... this may be a case for the new, proposed Script Request Handler, 
which would let you execute a query and then you could do any custom 
JavaScript logic you wanted.


When we get that feature, it might be interesting to implement a variation 
of the standard stats component as a JavaScript script, and then people 
could easily hack it such as in your request. Fascinating.


-- Jack Krupansky

-Original Message- From: Matt Lieber
Sent: Thursday, July 11, 2013 6:08 PM
To: solr-user@lucene.apache.org
Subject: How to set a condition over stats result




Hello,

I am trying to see how I can test the sum of values of an attribute across
docs.
I.e. Whether sum(myfieldvalue)100 .

I know I can use the stats module which compiles the sum of my attributes
on a certain facet , but how can I perform a test this result (i.e. Is
sum100) within my stats query? From what I read, it's not supported yet
to perform a function on the stats module..
Any other way to do this ?

Cheers,
Matt












NOTE: This message may contain information that is confidential, 
proprietary, privileged or otherwise protected by law. The message is 
intended solely for the named addressee. If received in error, please 
destroy and notify the sender. Any use of this email is prohibited when 
received in error. Impetus does not represent, warrant and/or guarantee, 
that the integrity of this communication has been maintained nor that the 
communication is free of errors, virus, interception or interference.

Re: What does too many merges...stalling in indexwriter log mean?

On 7/12/2013 9:23 AM, Tom Burton-West wrote:
 Do you have any feeling for what gets traded off if we increase the
 maxMergeCount?
 
 This is completely new for us because we are experimenting with indexing
 pages instead of whole documents.  Since our average document is about 370
 pages, this means that we have increased the number of documents we are
 asking Solr to index by a couple of orders of magnitude. (on the other hand
 the size of the document decreases by a couple of orders of magnitude).
 I'm not sure why increasing the number of documents (and reducing their
 size) is causing more merges.  I'll have to investigate.

I'm not sure that you lose anything, really.  If everything is
proceeding normally before the stalling message is logged, I would not
expect it to cause ANY problems.

The reason that I increased this value was because when I did a
full-import of millions of documents from mysql, I would reach the point
where there were three different levels of merges going on at once.
Because the default thread count is one, only the largest merge was
actually occurring, the others were queued and waiting.

With three merges stacked up at once, I had passed the maxMergeCount
threshold, so *indexing* stopped.  It can take several minutes for a
very large merge to finish, so indexing stopped long enough that the
MySQL server would drop the connection established by the JDBC driver.
Once the merge finished and DIH tried to resume indexing, the connection
was gone and it would fail the entire import.

I have never seen more than three merge levels happening at once, so a
value of 6 is probably overkill, but shouldn't be a problem.  The true
goal is to make sure that indexing never stops, not to push the system
limits.  The maxThreadCount parameter should prevent I/O from becoming a
problem.

Thanks,
Shawn

RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Zhang, Lisheng

Thanks very much for all the helps!

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Friday, July 12, 2013 7:31 AM
To: solr-user@lucene.apache.org
Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to
Zookeeper

On 7/12/2013 7:29 AM, Zhang, Lisheng wrote:
 Sorry I might not have asked clearly, our issue is that we have 
 a few thousand collections (can be much more), so running that 
 command is rather tedius, is there a simpler way (all collections 
 share same schema/config)?

When you create each collection with the Collections API (http calls),
you tell it the name of a config set stored in zookeeper.  You can give
all your collections the same config set if you like.

If you manually create collections with the CoreAdmin API instead, you
must use the zkcli script included in Solr to link the collection to the
config set, which can be done either before or after the collection is
created.  The zkcli script provides some automation for the java command
that you were given by Furkan.

Thanks,
Shawn

Re: Patch review request: SOLR-5001 (adding book links to the website)

2013-07-12 Thread Steve Rowe

Hi Alexandre,

I'll work on this today.

Steve

On Jul 12, 2013, at 8:26 AM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Hello,
 
 As per earlier email thread, I have created a patch for Solr website to
 incorporate links to my new book.
 
 It would be nice if somebody with commit rights for the (markdown) website
 could look at it before the book's Solr version (4.3.1) stops being the
 latest :-)
 
 I promise to help with the new Wiki/Guide later in return.
 
 https://issues.apache.org/jira/browse/SOLR-5001
 
 Regards,
   Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: Performance of cross join vs block join

2013-07-12 Thread Roman Chyla

Hi Mikhail,
I have commented on your blog, but it seems I have done st wrong, as the
comment is not there. Would it be possible to share the test setup (script)?

I have found out that the crucial thing with joins is the number of 'joins'
[hits returned] and it seems that the experiments I have seen so far were
geared towards small collection - even if Erick's index was 26M, the number
of hits was probably small - you can see a very different story if you face
some [other] real data. Here is a citation network and I was comparing
lucene join's [ie not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment])

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

Notice, the y axes is sqrt, so the running time for lucene join is growing
and growing very fast! It takes lucene 30s to do the search that selects 1M
hits.

The comparison is against our own implementation of a similar search - but
the main point I am making is that the join benchmarks should be showing
the number of hits selected by the join operation. Otherwise, a very
important detail is hidden.

Best,

  roman


On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com
 wrote:

  Hi Mikhail,
 
  I have used wrong the term block join. When I said block join I was
  referring to a join performed on a single core versus cross join which
 was
  performed on multiple cores.
  But I saw your benchmark (from cache) and it seems that block join has
  better performance. Is this functionality available on Solr 4.3.1?

 nope SOLR-3076 awaits for ages.


  I did not find such examples on Solr's wiki page.
  Does this functionality require a special schema, or a special indexing?

 Special indexing - yes.


  How would I need to index the data from my tables? In my case anyway all
  the indices have a common schema since I am using dynamic fields, thus I
  can easily add all documents from all tables in one Solr core, but for
 each
  document to add a discriminator field.
 
 correct. but notion of ' discriminator field' is a little bit different for
 blockjoin.


 
  Could you point me to some more documentation?
 

 I can recommend only those

 http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
 http://www.youtube.com/watch?v=-OiIlIijWH0


  Thanks in advance,
  Mihaela
 
 
  
   From: Mikhail Khludnev mkhlud...@griddynamics.com
  To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
  mihaela...@yahoo.com
  Sent: Thursday, July 11, 2013 2:25 PM
  Subject: Re: Performance of cross join vs block join
 
 
  Mihaela,
 
  For me it's reasonable that single core join takes the same time as cross
  core one. I just can't see which gain can be obtained from in the former
  case.
  I hardly able to comment join code, I looked into, it's not trivial, at
  least. With block join it doesn't need to obtain parentId term
  values/numbers and lookup parents by them. Both of these actions are
  expensive. Also blockjoin works as an iterator, but join need to allocate
  memory for parents bitset and populate it out of order that impacts
  scalability.
  Also in None scoring mode BJQ don't need to walk through all children,
 but
  only hits first. Also, nice feature is 'both side leapfrog' if you have a
  highly restrictive filter/query intersects with BJQ, it allows to skip
 many
  parents and children as well, that's not possible in Join, which has
 fairly
  'full-scan' nature.
  Main performance factor for Join is number of child docs.
  I'm not sure I got all your questions, please specify them in more
 details,
  if something is still unclear.
  have you saw my benchmark
  http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
 
 
 
  On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com
  wrote:
 
   Hello,
  
   Does anyone know about some measurements in terms of performance for
  cross
   joins compared to joins inside a single index?
  
   Is it faster the join inside a single index that stores all documents
 of
   various types (from parent table or from children tables)with a
   discriminator field compared to the cross join (basically in this case
  each
   document type resides in its own index)?
  
   I have performed some tests but to me it seems that having a join in a
   single index (bigger index) does not add too much speed improvements
   compared to cross joins.
  
   Why a block join would be faster than a cross join if this is the case?
   What are the variables that count when trying to improve the query
   execution time?
  
   Thanks!
   Mihaela
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,

Re: Problem using Term Component in solr

2013-07-12 Thread Parul Gupta(Knimbus)

Hi,
Ok I will not use Bold text in my queries  

I guess my question is not clear to you

See what I am doing is, i have a live source say 'A'  and a stored database
say it as 'B'.ok
A and B ,both have title fields in them.Consider A as non-persistent solr
and B as persistent solr.

I have to match the title coming from A to the database B.

Since some title from live source A comes in short form e.g 'med. phys.' and
'phys. fluids'.
But corresponding to these titles my database B have titles 'medical
physics' and 'physics of fluids'.
Since this type of differences occurs and A not able to search there
corresponding titles in B by using 'tokenized' field 'title' with using wild
cards,hence i used Term component first.Which gives me the corresponding
matched title with B.When i got the full title like 'medical physics',i
fetched it from HTML,and then again search it in tokenized field of 'title'
say it 'titlenew'(copy field of title) which brings me result 'medical
physics'.But I am failing to get match of 'phys. fluids' with 'physics of
fluids' as it has stop word in it using [a-z0-9]*.

Hope know u will get my issue...and will help..
thanks..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: expunging deletes

2013-07-12 Thread Petersen, Robert

OK Thanks Shawn,

 I went with this because 10 wasn't working for us and it looks like my index 
is staying under 20 GB now with numDocs : 16897524 and maxDoc : 19048053

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce5/int
  int name=segmentsPerTier5/int
  int name=maxMergeAtOnceExplicit15/int
  double name=maxMergedSegmentMB6144.0/double
  double name=reclaimDeletesWeight6.0/double
/mergePolicy



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, July 10, 2013 5:34 PM
To: solr-user@lucene.apache.org
Subject: Re: expunging deletes

On 7/10/2013 5:58 PM, Petersen, Robert wrote:
 Using solr 3.6.1 and the following settings, I am trying to run without 
 optimizes.  I used to optimize nightly, but sometimes the optimize took a 
 very long time to complete and slowed down our indexing.  We are continuously 
 indexing our new or changed data all day and night.  After a few days running 
 without an optimize, the index size has nearly doubled and maxdocs is nearly 
 twice the size of numdocs.  I understand deletes should be expunged on 
 merges, but even after trying lots of different settings for our merge policy 
 it seems this growth is somewhat unbounded.  I have tried sending an optimize 
 with numSegments = 2 which is a lot lighter weight then a regular optimize 
 and that does bring the number down but not by too much.  Does anyone have 
 any ideas for better settings for my merge policy that would help?  Here is 
 my current index snapshot too:

Your merge settings are the equivalent of the old mergeFactor set to 35, and 
based on the fact that you have the Explicit set to 105, I'm guessing your 
settings originally came from something I posted - these are the numbers that I 
use.  These settings can result in a very large number of segments on your disk.

Because you index a lot (and probably reindex existing documents often), I can 
understand why you have high merge settings, but if you want to eliminate 
optimizes, you'll need to go lower.  The default merge setting of 10 (with an 
Explicit value of 30) is probably a good starting point, but you might need to 
go even smaller.

On Solr 3.6, an optimize probably cannot take place at the same time as index 
updates -- the optimize would probably delay updates until after it's finished. 
 I remember running into problems on Solr 3.x, so I set up my indexing program 
to stop updates while the index was optimizing.

Solr 4.x should lift any restriction where optimizes and updates can't happen 
at the same time.

With an index size of 25GB, a six-drive RAID10 should be able to optimize in 
10-15 minutes, but if your I/O system is single disk, RAID1, RAID5, or RAID6, 
the write performance may cause this to take longer.
If you went with SSD, optimizes would happen VERY fast.

Thanks,
Shawn

Re: How to set a condition on the number of docs found

2013-07-12 Thread Matt Lieber

Thanks William, I'll do that.

Matt


On 7/12/13 7:38 AM, William Bell billnb...@gmail.com wrote:

Hmmm. One way is:

http://localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.field=id;
facet.offset=10rows=0facet.limit=1http://hgsolr2devmstr:8983/solr/provi
dersearch/select/?q=*%3A*facet=truefacet.field=cityfacet.offset=10rows
=0facet.limit=1

If you have a result you have results  10.

Another way is to just look at it wth a facet.query and have your app deal
with it.

http:/localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.query={!lu
cene%20key=numberofresults}state:COrows=0http://hgsolr2devmstr:8983/solr
/providersearch/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numb
erofresults}state:COrows=0




On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello there,

 I would like to be able to know whether I got over a certain threshold
of
 doc results.

 I.e. Test (Result.numFound  10 ) - true.

 Is there a way to do this ? I can't seem to find how to do this; (other
 than have to do this test on the client app, which is not great).

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that
the
 communication is free of errors, virus, interception or interference.




--
Bill Bell
billnb...@gmail.com
cell 720-256-8076









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

Save Solr index in database

2013-07-12 Thread sagarmj76

hi I wanted to understand if it is possible to store/save Solr indexes to the
database instead of the filesystem. I checked out some articles where lucene
can do it. Hence I assume Solr can too but its not clear to me how to
configure Solr to save the indexes in the database instead in the /index
directory.
Any help is really appreciated as I think I have hit a wall with this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev

Hello Roman,

Thanks for your interest. I briefly looked on your approach, and I'm really
interested in your numbers.

Here is the trivial code, I'd rather prefer rely on your testing framework,
and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you
need it?
https://github.com/m-khl/join-tester

What you are saying about benchmark representativeness definitely makes
sense. I didn't try to establish a complete absolutely representative
benchmark. Just wanted to have rough numbers, related for my usecase,
certainly. I'm from eCommerce, that volume was enough for me.

What I didn't get is, 'not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment'. Usually,
there is no problem with blocks in multi segment index, block definitely
can't span across segments. Anyway, please elaborate.
One of block join benefits is an ability to hit only the first matched
child in group, and jump over followings. It doesn't applicable in general,
but get huge gain some times.


On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Mikhail,
 I have commented on your blog, but it seems I have done st wrong, as the
 comment is not there. Would it be possible to share the test setup
 (script)?

 I have found out that the crucial thing with joins is the number of 'joins'
 [hits returned] and it seems that the experiments I have seen so far were
 geared towards small collection - even if Erick's index was 26M, the number
 of hits was probably small - you can see a very different story if you face
 some [other] real data. Here is a citation network and I was comparing
 lucene join's [ie not the block joins, because these cannot be used for
 citation data - we cannot reasonably index them into one segment])


 https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

 Notice, the y axes is sqrt, so the running time for lucene join is growing
 and growing very fast! It takes lucene 30s to do the search that selects 1M
 hits.

 The comparison is against our own implementation of a similar search - but
 the main point I am making is that the join benchmarks should be showing
 the number of hits selected by the join operation. Otherwise, a very
 important detail is hidden.

 Best,

   roman


 On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com
  wrote:
 
   Hi Mikhail,
  
   I have used wrong the term block join. When I said block join I was
   referring to a join performed on a single core versus cross join which
  was
   performed on multiple cores.
   But I saw your benchmark (from cache) and it seems that block join has
   better performance. Is this functionality available on Solr 4.3.1?
 
  nope SOLR-3076 awaits for ages.
 
 
   I did not find such examples on Solr's wiki page.
   Does this functionality require a special schema, or a special
 indexing?
 
  Special indexing - yes.
 
 
   How would I need to index the data from my tables? In my case anyway
 all
   the indices have a common schema since I am using dynamic fields, thus
 I
   can easily add all documents from all tables in one Solr core, but for
  each
   document to add a discriminator field.
  
  correct. but notion of ' discriminator field' is a little bit different
 for
  blockjoin.
 
 
  
   Could you point me to some more documentation?
  
 
  I can recommend only those
 
 
 http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
  http://www.youtube.com/watch?v=-OiIlIijWH0
 
 
   Thanks in advance,
   Mihaela
  
  
   
From: Mikhail Khludnev mkhlud...@griddynamics.com
   To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
   mihaela...@yahoo.com
   Sent: Thursday, July 11, 2013 2:25 PM
   Subject: Re: Performance of cross join vs block join
  
  
   Mihaela,
  
   For me it's reasonable that single core join takes the same time as
 cross
   core one. I just can't see which gain can be obtained from in the
 former
   case.
   I hardly able to comment join code, I looked into, it's not trivial, at
   least. With block join it doesn't need to obtain parentId term
   values/numbers and lookup parents by them. Both of these actions are
   expensive. Also blockjoin works as an iterator, but join need to
 allocate
   memory for parents bitset and populate it out of order that impacts
   scalability.
   Also in None scoring mode BJQ don't need to walk through all children,
  but
   only hits first. Also, nice feature is 'both side leapfrog' if you
 have a
   highly restrictive filter/query intersects with BJQ, it allows to skip
  many
   parents and children as well, that's not possible in Join, which has
  fairly
   'full-scan' nature.
   Main performance factor for Join is number of child docs.
   I'm not sure I got all your questions, please specify them in more

Re: Save Solr index in database

2013-07-12 Thread Alexandre Rafalovitch

And why would you want to do that? Seems rather wrong direction to march in.

I am assuming relational database. There is a commercial solution that
integrates Solr into Cassandra, if I understood it correctly:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr
Even
then, there might be some stuff on the filesystem.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 12, 2013 at 2:30 PM, sagarmj76 sagarm_jad...@yahoo.com wrote:

 hi I wanted to understand if it is possible to store/save Solr indexes to
 the
 database instead of the filesystem. I checked out some articles where
 lucene
 can do it. Hence I assume Solr can too but its not clear to me how to
 configure Solr to save the indexes in the database instead in the /index
 directory.
 Any help is really appreciated as I think I have hit a wall with this.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Save Solr index in database


On 7/12/2013 12:30 PM, sagarmj76 wrote:

hi I wanted to understand if it is possible to store/save Solr indexes to the
database instead of the filesystem. I checked out some articles where lucene
can do it. Hence I assume Solr can too but its not clear to me how to
configure Solr to save the indexes in the database instead in the /index
directory.
Any help is really appreciated as I think I have hit a wall with this.


If Lucene can do it, then theoretically Solr can do so as well.  You 
could very likely add jars to your classpath (to add a Directory and 
DirectoryFactory implementation that uses a database) and reference that 
class in the Solr config, but unless the class provided a way to 
configure itself, you probably wouldn't be able to specify its config 
within Solr's config without custom plugin code.


A burning question ... why would you want to do this?  Lucene and Solr 
are highly optimized to work well with a local filesystem.  That is the 
path that will give you the best performance.


Thanks,
Shawn

Re: Save Solr index in database

2013-07-12 Thread Sagar Jadhav

The reason for going that route is because our application is clustered and
if the indexing information is on the filesystem, I am not sure whether that
would be replicated. At the same time since its a product it needs to be
packaged with the product and also from a proprietary reason we are not
allowed to use the filesystem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Save Solr index in database

2013-07-12 Thread Gora Mohanty

On 13 July 2013 00:19, Shawn Heisey s...@elyograg.org wrote:
 On 7/12/2013 12:30 PM, sagarmj76 wrote:

 hi I wanted to understand if it is possible to store/save Solr indexes to
 the
 database instead of the filesystem. I checked out some articles where
 lucene
 can do it. Hence I assume Solr can too but its not clear to me how to
 configure Solr to save the indexes in the database instead in the /index
 directory.
 Any help is really appreciated as I think I have hit a wall with this.
[...]

As others have noted, think twice about why you
would want to do this. Lucene does it through
JdbcDirectory but as far as I know this is only an
interface without a concrete implementation, though
apparently third-party libraries are available that
implement JdbcDirectory. The Lucene FAQ notes
that this is slow:
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F

Regards,
Gora

Re: Save Solr index in database


On 7/12/2013 12:51 PM, Sagar Jadhav wrote:

The reason for going that route is because our application is clustered and
if the indexing information is on the filesystem, I am not sure whether that
would be replicated. At the same time since its a product it needs to be
packaged with the product and also from a proprietary reason we are not
allowed to use the filesystem.


Solr can do replication from a master server to slaves.  If you 
implement as SolrCloud, then you would have a clustered solution with no 
master/slave designations.  SolrCloud requires a three server minimum 
for a robust deployment.  The third server can be a wimpy thing that 
only runs zookeeper.


Putting your index in a DB is just a bad idea.  It would be hard to find 
help with it, and performance would not be good.


Thanks,
Shawn

Re: Save Solr index in database

2013-07-12 Thread Sagar Jadhav

I think that makes a lot of sense as I was reading the Solr Cloud technique.
Thanks a lot Shawn for the validation. 
Thanks a lot everyone for helping me out to go in the right direction. I
really appreciate all the inputs. I will now go back and get the exception
for getting access to the filesytem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Save Solr index in database

2013-07-12 Thread Upayavira

If they ask, tell them that Solr *is* a database. Databases store their
stuff on a file system, so your data is gonna end up there in the end.
Putting Solr indexes inside a database is like storing mysql tables in
Oracle.

Upayavira

On Fri, Jul 12, 2013, at 08:18 PM, Sagar Jadhav wrote:
 I think that makes a lot of sense as I was reading the Solr Cloud
 technique.
 Thanks a lot Shawn for the validation. 
 Thanks a lot everyone for helping me out to go in the right direction. I
 really appreciate all the inputs. I will now go back and get the
 exception
 for getting access to the filesytem.
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem using Term Component in solr

Is the vocabulary known? That is, do you know the abbreviations that
will be used? If so, you could consider synonyms, in which case you'd
go to tokenized titles and use phrase queries to get your matches...

Regexes often don't scale extremely well, although the 4.x FST
implementations are much faster than they used to be.

It seems to me that regularizing the titles is a better idea than
trying to fake it with regexes, but you know your problem space better
than me...

Best
Erick

On Fri, Jul 12, 2013 at 1:32 PM, Parul Gupta(Knimbus)
parulgp...@gmail.com wrote:
Hi,
Ok I will not use Bold text in my queries

I guess my question is not clear to you

See what I am doing is, i have a live source say 'A' and a stored database
say it as 'B'.ok
A and B ,both have title fields in them.Consider A as non-persistent solr
and B as persistent solr.

I have to match the title coming from A to the database B.

Since some title from live source A comes in short form e.g 'med. phys.' and
'phys. fluids'.
But corresponding to these titles my database B have titles 'medical
physics' and 'physics of fluids'.
Since this type of differences occurs and A not able to search there
corresponding titles in B by using 'tokenized' field 'title' with using wild
cards,hence i used Term component first.Which gives me the corresponding
matched title with B.When i got the full title like 'medical physics',i
fetched it from HTML,and then again search it in tokenized field of 'title'
say it 'titlenew'(copy field of title) which brings me result 'medical
physics'.But I am failing to get match of 'phys. fluids' with 'physics of
fluids' as it has stop word in it using [a-z0-9]*.

Hope know u will get my issue...and will help..
thanks..

--
View this message in context:
http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html
Sent from the Solr - User mailing list archive at Nabble.com.

solr autodetectparser tikaconfig dataimporter error

2013-07-12 Thread Andreas Owen

i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
import a
file via xml i get this error, it doesn't matter what file format i try =
to index txt, cfm, pdf all the same error:

SEVERE: Exception while processing: rec document :
SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
contents=3Dcontents(1.0)=3D{wie
kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
=
path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
rollback

data-config.xml:
dataConfig
dataSource type=3DBinURLDataSource name=3Ddata/
dataSource type=3DURLDataSource =
baseUrl=3Dhttp://127.0.0.1/tkb/internet/;
name=3Dmain/
document
entity name=3Drec processor=3DXPathEntityProcessor =
url=3DdocImport.xml
forEach=3D/albums/album dataSource=3Dmain=20
field column=3Dtitle xpath=3D//title /
field column=3Did xpath=3D//file /
field column=3Dcontents xpath=3D//description /
field column=3Dpath xpath=3D//path /
field column=3DAuthor xpath=3D//author /
=09
=09
=09
entity processor=3DTikaEntityProcessor
=
url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re=
c.id}
dataSource=3Ddata onerror=3Dskip
 field column=3Dcontents name=3Dtext /
/entity
/entity
/document
/dataConfig

the lib are included and declared in the logs, i have also tried =
tika-app
1.0 and tagsoup 1.2 with the same result. can someone please help, i =
don't
know where to start looking for the error.

add to ContributorsGroup

2013-07-12 Thread Ken Geis

Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd 
like to fix some typos.



Thanks,

Ken Geis

Re: add to ContributorsGroup

Done, Thanks for helping!

Erick

On Fri, Jul 12, 2013 at 4:30 PM, Ken Geis kg...@speakeasy.net wrote:
 Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd like
 to fix some typos.


 Thanks,

 Ken Geis

add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib

Hello,

Can you please add me to the ContributorsGroup? I would like to add
instructions for setting up SolrCloud using Jboss.

thanks.

Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib

username: saqib


On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com wrote:

 Hello,

 Can you please add me to the ContributorsGroup? I would like to add
 instructions for setting up SolrCloud using Jboss.

 thanks.

Re: Norms

2013-07-12 Thread Lance Norskog

Norms stay in the index even if you delete all of the data. If you just 
changed the schema, emptied the index, and tested again, you've still 
got norms in there.


You can examine the index with Luke to verify this.

On 07/09/2013 08:57 PM, William Bell wrote:

I have a field that has omitNorms=true, but when I look at debugQuery I see
that
the field is being normalized for the score.

What can I do to turn off normalization in the score?

I want a simple way to do 2 things:

boost geodist() highest at 1 mile and lowest at 100 miles.
plus add a boost for a query=edgefield^5.

I only want tf() and no queryNorm. I am not even sure I want idf() but I
can probably live with rare names being boosted.



The results are being normalized. See below. I tried dismax and edismax -
bf, bq and boost.

requestHandler name=autoproviderdist class=solr.SearchHandler
lst name=defaults
str name=echoParamsnone/str
str name=defTypeedismax/str
float name=tie0.01/float
str name=fl
display_name,city_state,prov_url,pwid,city_state_alternative
/str
!--
str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6),
0.1)^10/str
--
str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str
int name=rows5/int
str name=q.alt*:*/str
str name=qfname_edgy^.9 name_edge^.9 name_word/str
str name=grouptrue/str
str name=group.fieldpwid/str
str name=group.maintrue/str
!-- str name=pfname_edgy/str do not turn on --
str name=sortscore desc, last_name asc/str
str name=d100/str
str name=pt39.740112,-104.984856/str
str name=sfieldstore_geohash/str
str name=hlfalse/str
str name=hl.flname_edgy/str
str name=mm2-1 4-2 6-3/str
/lst
/requestHandler

0.058555886 = queryNorm

product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378),
product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 =
boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 =
boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
(MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
fieldWeight(name_word:nutting in 231378), product of: 1.0 =
tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
(MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
(MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =
sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2013-07-12 Thread Ali, Saqib

I am getting a java.lang.OutOfMemoryError: Requested array size exceeds VM
limit on certain queries.

Please advise:

19:25:02,632 INFO  [org.apache.solr.core.SolrCore]
(http-oktst1509.company.tld/12.5.105.96:8180-9) [collection1] webapp=/solr
path=/select
params={sort=sent_date+ascdistrib=falsewt=javabinversion=2rows=2147483647df=textfl=idshard.url=
12.5.105.96:8180/solr/collection1/NOW=1373675102627start=0q=thread_id:1439513570014188310isShard=truefq=domain:company.tld+AND+owner:11782344fsv=true}
hits=1 status=0 QTime=1
19:25:02,637 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
(http-oktst1509.company.tld/12.5.105.96:8180-2)
null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Requested
array size exceeds VM limit
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
at
org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

zero-valued retrieval scores

2013-07-12 Thread Joe Zhang

when I search a keyword (such as apple), most of the docs carry 0.0 as
score. Here is an example from explain:

str name=
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html;
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!

Re: zero-valued retrieval scores

Did you put a boost of 0.0 on the documents, as opposed to the default of 
1.0?


x * 0.0 = 0.0

-- Jack Krupansky

-Original Message- 
From: Joe Zhang

Sent: Friday, July 12, 2013 10:31 PM
To: solr-user@lucene.apache.org
Subject: zero-valued retrieval scores

when I search a keyword (such as apple), most of the docs carry 0.0 as
score. Here is an example from explain:

str name=
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html;
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
 1.0 = tf(termFreq(content:appl)=1)
 2.096877 = idf(docFreq=5190, maxDocs=15546)
 0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!

Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang

Yes, you are right, the boost on these documents are 0. I didn't provide
them, though.

I suppose the boost scores come from Nutch (yes, my solr indexes crawled
web docs). What could be wrong?

again, what exactly is the formula for fieldNorm?


On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.comwrote:

 Did you put a boost of 0.0 on the documents, as opposed to the default of
 1.0?

 x * 0.0 = 0.0

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Friday, July 12, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: zero-valued retrieval scores


 when I search a keyword (such as apple), most of the docs carry 0.0 as
 score. Here is an example from explain:

 str name=
 http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html
 
 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
 Can somebody help me understand why fieldNorm is 0? What exactly is the
 formula for computing fieldNorm?

 Thanks!

Re: zero-valued retrieval scores