Re: Solr performance issue

2011-03-14 Thread Jonathan Rochkind
I've definitely had cases in 1.4.1 where even though I didn't have an 
OOM error, Solr was being weirdly slow, and increasing the JVM heap size 
fixed it.  I can't explain why it happened, or exactly how you'd know 
this was going on, I didn't see anything odd in the logs to indicate, I 
just tried increasing the JVM heap to see what happened, and it worked 
great.


The one case I remember specifically is when I was using the 
StatsComponent, with a stats.facet.  Pathologically slow, increasing 
heap magically made it go down to negligible again.


On 3/14/2011 3:38 PM, Markus Jelsma wrote:

Hello,

2011/3/14 Markus Jelsmamarkus.jel...@openindex.io


Hi Doğacan,

Are you, at some point, running out of heap space? In my experience,
that's the common cause of increased load and excessivly high response
times (or time
outs).

How much of a heap size would be enough? Our index size is growing slowly
but we did not have this problem
a couple weeks ago where index size was maybe 100mb smaller.

Telling how much heap space is needed isn't easy to say. It usually needs to
be increased when you run out of memory and get those nasty OOM errors, are
you getting them?
Replication eventes will increase heap usage due to cache warming queries and
autowarming.


We left most of the caches in solrconfig as default and only increased
filterCache to 1024. We only ask for ids (which
are unique) and no other fields during queries (though we do faceting).
Btw, 1.6gb of our index is stored fields (we store
everything for now, even though we do not get them during queries), and
about 1gb of index.

Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are there
a lot of entries? Is there an insanity count? Do you use boost functions?

It might not have anything to do with memory at all but i'm just asking. There
may be a bug in your revision causing this.


Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get any
improvement in load. I can try monitoring with Jconsole
with 8gigs of heap to see if it helps.


Cheers,


Hello everyone,

First of all here is our Solr setup:

- Solr nightly build 986158
- Running solr inside the default jetty comes with solr build
- 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of
RAM) - Index replicated (on optimize) to slaves via Solr Replication
- Size of index is around 2.5gb
- No incremental writes, index is created from scratch(delete old

documents


-  commit new documents -  optimize)  every 6 hours
- Avg # of request per second is around 60 (for a single slave)
- Avg time per request is around 25ms (before having problems)
- Load on each is slave is around 2

We are using this set-up for months without any problem. However last

week


we started to experience very weird performance problems like :

- Avg time per request increased from 25ms to 200-300ms (even higher if

we


don't restart the slaves)
- Load on each slave increased from 2 to 15-20 (solr uses %400-%600
cpu)

When we profile solr we see two very strange things :

1 - This is the jconsole output:

https://skitch.com/meralan/rwwcf/mail-886x691

As you see gc runs for every 10-15 seconds and collects more than 1 gb
of memory. (Actually if you wait more than 10 minutes you see spikes
up to

4gb


consistently)

2 - This is the newrelic output :

https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm

As you see solr spent ridiculously long time in
SolrDispatchFilter.doFilter() method.


Apart form these, when we clean the index directory, re-replicate and
restart  each slave one by one we see a relief in the system but after

some


time servers start to melt down again. Although deleting index and
replicating doesn't solve the problem, we think that these problems are
somehow related to replication. Because symptoms started after

replication


and once it heals itself after replication. I also see
lucene-write.lock files in slaves (we don't have write.lock files in
the master) which I think we shouldn't see.


If anyone can give any sort of ideas, we will appreciate it.

Regards,
Dogacan Guney


Re: Solr performance issue

2011-03-14 Thread Jonathan Rochkind
It's actually, as I understand it, expected JVM behavior to see the heap 
rise to close to it's limit before it gets GC'd, that's how Java GC 
works.  Whether that should happen every 20 seconds or what, I don't nkow.


Another option is setting better JVM garbage collection arguments, so GC 
doesn't stop the world so often. I have had good luck with my Solr 
using this:  -XX:+UseParallelGC






On 3/14/2011 4:15 PM, Doğacan Güney wrote:

Hello again,

2011/3/14 Markus Jelsmamarkus.jel...@openindex.io


Hello,

2011/3/14 Markus Jelsmamarkus.jel...@openindex.io


Hi Doğacan,

Are you, at some point, running out of heap space? In my experience,
that's the common cause of increased load and excessivly high response
times (or time
outs).

How much of a heap size would be enough? Our index size is growing slowly
but we did not have this problem
a couple weeks ago where index size was maybe 100mb smaller.

Telling how much heap space is needed isn't easy to say. It usually needs
to
be increased when you run out of memory and get those nasty OOM errors, are
you getting them?
Replication eventes will increase heap usage due to cache warming queries
and
autowarming.



Nope, no OOM errors.



We left most of the caches in solrconfig as default and only increased
filterCache to 1024. We only ask for ids (which
are unique) and no other fields during queries (though we do faceting).
Btw, 1.6gb of our index is stored fields (we store
everything for now, even though we do not get them during queries), and
about 1gb of index.

Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are
there
a lot of entries? Is there an insanity count? Do you use boost functions?



Insanity count is 0 and fieldCAche has 12 entries. We do use some boosting
functions.

Btw, I am monitoring output via jconsole with 8gb of ram and it still goes
to 8gb every 20 seconds or so,
gc runs, falls down to 1gb.

Btw, our current revision was just a random choice but up until two weeks
ago it has been rock-solid so we have been
reluctant to update to another version. Would you recommend upgrading to
latest trunk?



It might not have anything to do with memory at all but i'm just asking.
There
may be a bug in your revision causing this.


Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get

any

improvement in load. I can try monitoring with Jconsole
with 8gigs of heap to see if it helps.


Cheers,


Hello everyone,

First of all here is our Solr setup:

- Solr nightly build 986158
- Running solr inside the default jetty comes with solr build
- 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb

of

RAM) - Index replicated (on optimize) to slaves via Solr Replication
- Size of index is around 2.5gb
- No incremental writes, index is created from scratch(delete old

documents


-  commit new documents -  optimize)  every 6 hours
- Avg # of request per second is around 60 (for a single slave)
- Avg time per request is around 25ms (before having problems)
- Load on each is slave is around 2

We are using this set-up for months without any problem. However last

week


we started to experience very weird performance problems like :

- Avg time per request increased from 25ms to 200-300ms (even higher

if

we


don't restart the slaves)
- Load on each slave increased from 2 to 15-20 (solr uses %400-%600
cpu)

When we profile solr we see two very strange things :

1 - This is the jconsole output:

https://skitch.com/meralan/rwwcf/mail-886x691

As you see gc runs for every 10-15 seconds and collects more than 1

gb

of memory. (Actually if you wait more than 10 minutes you see spikes
up to

4gb


consistently)

2 - This is the newrelic output :

https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm

As you see solr spent ridiculously long time in
SolrDispatchFilter.doFilter() method.


Apart form these, when we clean the index directory, re-replicate and
restart  each slave one by one we see a relief in the system but

after

some


time servers start to melt down again. Although deleting index and
replicating doesn't solve the problem, we think that these problems

are

somehow related to replication. Because symptoms started after

replication


and once it heals itself after replication. I also see
lucene-write.lock files in slaves (we don't have write.lock files in
the master) which I think we shouldn't see.


If anyone can give any sort of ideas, we will appreciate it.

Regards,
Dogacan Guney





Re: disquery - difference qf qs / pf ps

2011-03-10 Thread Jonathan Rochkind

On 3/10/2011 8:15 AM, Gastone Penzo wrote:

Thank you very much. i understand the difference beetween qs and ps but not
what pf is...is it necessary to use ps?


It's not neccesary to use anything, including Solr.

pf:  Will take the entire query the user entered, make it into a single 
phrase, and boost documents within the already existing result set that 
match that phrase. pf does not change the result set, it just changes 
the ranking.
ps: Will set phrase query slop on that pf query of the entire entered 
search string, that effects boosting.





Re: True master-master fail-over without data gaps

2011-03-09 Thread Jonathan Rochkind

On 3/9/2011 12:05 PM, Otis Gospodnetic wrote:
But check this! In some cases one is not allowed to save content to 
disk (think

copyrights).  I'm not making this up - we actually have a customer with this
cannot save to disk (but can index) requirement.


Do they realize that a Solr index is on disk, and if you save it to a 
Solr index it's being saved to disk?  If they prohibited you from 
putting the doc in a stored field in Solr, I guess that would at least 
be somewhat consistent, although annoying.


But I don't think it's our customers jobs to tell us HOW to implement 
our software to get the results they want. They can certainly make you 
promise not to distribute or use copyrighted material, and they can even 
ask to see your security procedures to make sure it doesn't get out.  
But if you need to buffer documents to achieve the application they 
want, but they won't let you... Solr can't help you with that.


As I suggested before though, I might rather buffer to a NoSQL store 
like MongoDB or CouchDB instead of actually to disk. Perhaps your 
customer won't notice those stores keep data on disk just like they 
haven't noticed Solr does.  I am not an expert in various kinds of NoSQL 
stores, but I think some of them in fact specialize in the area of 
concern here: Absolute failover reliability through replication.


Solr is not a store.


So buffering to disk is not an option, and buffering in memory is not practical
because of the input document rate and their size.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent:  Tuesday, March 08, 2011 11:45 PM
To: solr-user@lucene.apache.org
Subject:  True master-master fail-over without data gaps

Hello,

What are  some common or good ways to handle indexing (master) fail-over?
Imagine you  have a continuous stream of incoming documents that you have to
index without  losing any of them (or with losing as few of them as possible).
How do you  set up you masters?
In other words, you can't just have 2 masters where the  secondary is the
Repeater (or Slave) of the primary master and replicates the  index
periodically:
you need to have 2 masters that are in sync at all  times!
How do you achieve that?

* Do you just put N masters behind a  LB VIP, configure them both to point to
the
index on some shared storage  (e.g. SAN), and count on the LB to fail-over to
the
secondary master when the  primary becomes unreachable?
If so, how do you deal with index locks?   You use the Native lock and count

on

it disappearing when the primary master  goes down?  That means you count on
the
whole JVM process dying, which  may not be the case...

* Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2 masters
with 2 separate indices in sync, while  making sure you write to only 1 of

them

via LB VIP or otherwise?

* Or  ...


This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1

Here is another similar  thread, but this one doesn't cover how 2 masters are
kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1

Thanks,
Otis

Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




Re: Excluding results from more like this

2011-03-09 Thread Jonathan Rochkind
Yeah, that just restricts what items are in your main result set (and 
adding -4 has no real effect).


The more like this set is constructed based on your main result set, for 
each document in it.


As far as I can see from here: http://wiki.apache.org/solr/MoreLikeThis

..there seems to be no built-in way to customize the 'more like this' 
results in the way you want, excluding certain document id's.  I don't 
entirely understand what mlt.boost  does, but I don't think it does 
anything useful for this case.


So, if that's so,  you are out of luck, unless you want to write Java 
code. In which case you could try customizing or adding that feature to 
the MoreLikeThis search component, and either suggest your new code back 
as a patch, or just use your own customized version of MoreLikeThis.


On 3/9/2011 4:29 PM, Brian Lamb wrote:

That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis
results.

On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodneticotis_gospodne...@yahoo.com

wrote:
Brian,

...?q=id:(2  3 5) -4


Otis
---
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 

From: Brian Lambbrian.l...@journalexperts.com
To: solr-user@lucene.apache.org
Sent: Wed, March 9, 2011 4:05:10 PM
Subject: Excluding results from more like this

Hi all,

I'm using MoreLikeThis to find similar results but I'd like to  exclude
records by the id number. For example, I use the following  URL:

http://localhost:8983/solr/search/?q=id:(2  3
5)mlt=truemlt.fl=description,idfl=*,score

How would I  exclude record 4 form the MoreLikeThis results?

I tried,

http://localhost:8983/solr/search/?q=id:(2  3
5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4

But  that still returned record 4 in the MoreLikeThisResults.



Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jonathan Rochkind
Yes, but the identical index with the identical solrconfig.xml and the 
identical query and the identical version of Solr on two different 
machines should preduce identical results.


So it's a legitimate question why it's not.  But perhaps queryNorm isn't 
enough to answer that. Sorry, it's out of my league to try and figure 
out it out.


But are you absolutely sure you have identical indexes, identical 
solrconfig.xml, identical queries, and identical versions of Solr and 
any other installed Java libraries... on both machines?  One of these 
being different seems more likely than a bug in Solr, although that's 
possible.


On 3/9/2011 4:34 PM, Jayendra Patil wrote:

queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk  wrote:

Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database.

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
-0.32609802 = queryWeight(profile:dubai^2.0), product of:
+  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
+0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-0.36931866 = (MATCH) max plus 0.01 times others of:
-  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
-0.003954251 = queryWeight(text:product^0.1), product of:
-  0.1 = boost
+0.17194802 = (MATCH) max 

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jonathan Rochkind
Wait, if you don't have identical indexes, then why would you expect 
identical results?


If your indexes are different, one would expect the results for the same 
query to be different -- there are different documents in the index!   
The iDF portion of the TF/iDF type algorithm at the base of Solr's 
relevancy will also be different in different indexes. 
http://en.wikipedia.org/wiki/Tf%E2%80%93idf


Maybe I'm misunderstanding you.  But if you have different indexes -- 
not exactly the same collection of documents indexed using exactly the 
same field definitions and rules -- then one should expect different 
relevance results.


Jonathan

On 3/9/2011 4:48 PM, Allistair Crossley wrote:

That's what I think, glad I am not going mad.

I've spent 1/2 a day comparing the config files, checking out from SVN again 
and ensuring the databases are identical. I cannot see what else I can do to 
make them equivalent. Both servers checkout directly from SVN, I am convinced 
the files are the same. The database is definately the same.

Not sure what you mean about having identical indices - that's my problem - I 
don't - or do you mean something else I've missed? But yes everything else you 
mention is identical, I am as certain as I can be.

I too think there must be a difference I have missed but I have run out of 
ideas for what to check!

Frustrating :)

On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote:


Yes, but the identical index with the identical solrconfig.xml and the 
identical query and the identical version of Solr on two different machines 
should preduce identical results.

So it's a legitimate question why it's not.  But perhaps queryNorm isn't enough 
to answer that. Sorry, it's out of my league to try and figure out it out.

But are you absolutely sure you have identical indexes, identical 
solrconfig.xml, identical queries, and identical versions of Solr and any other 
installed Java libraries... on both machines?  One of these being different 
seems more likely than a bug in Solr, although that's possible.

On 3/9/2011 4:34 PM, Jayendra Patil wrote:

queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk   wrote:

Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database.

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH

Re: NRT in Solr

2011-03-09 Thread Jonathan Rochkind
Interesting, does anyone have a summary of what techniques zoie uses to 
do this?  I don't see any docs on the technical details.


On 3/9/2011 5:29 PM, Smiley, David W. wrote:

Zoie adds NRT to Solr:
http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

I haven't tried it yet but looks cool.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:


Jae,

NRT hasn't been implemented NRT as of yet in Solr, I think partially
because major features such as replication, caching, and uninverted
faceting suddenly are no longer viable, eg, it's another round of
testing etc.  It's doable, however I think the best approach is a
separate request call path, to avoid altering to current [working]
API.

On Tue, Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com  wrote:

Hi,
Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not
find the configuration for NRT.

Regards

Jae









Re: Solr Hanging all of sudden with update/csv

2011-03-08 Thread Jonathan Rochkind
My guess is that you're running out of RAM.  Actual Java profiling is 
beyond me, but I have seen issues on updating that were solved by more RAM.


If you are updating every few minutes, and your new index takes more 
than a few minutes to warm, you could be running into overlapping 
warming indexes issues. Some more info on what I mean by this in this 
FAQ, although the FAQ isn't actually targetted at this case exactly: 
http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F


Overlapping warming indexes can result in excessive RAM and/or CPU usage.

If you haven't given your JVM options to tune the JVM Garbage 
Collection, that can also help things, using the options for concurrent 
thread GC.  But if your fundamental problem is overlapping warming 
queries, you probably need to make that stop.


On 3/8/2011 5:17 PM, danomano wrote:

Hi folks, I've been using solr for about 3 months.

Our Solr install is a single node, and we have been injecting logging data
into the solr server every couple of minutes, which each updating taking few
minutes.

Everything working fine until this morning, at which point it appeared that
all updates were hung.

Retarting the solr server did not help, as all updaters immediately 'hung'
again.

Poking around in the threads, and strace, I do in fact see stuff happening.

The index size itself is about 270Gb, (we are hopping to support upto
500-1TB), and have supplied the system with ~3TB diskspace.

Any Tips on what could be happening?
notes: we have never run an optimize yet.
   we have never deleted from system yet.


The merge Thread appears to be the one..'never returnning'
Lucene Merge Thread #0 - Thread t@41
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcher.pread0(Native Method)
at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:31)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
at sun.nio.ch.IOUtil.read(IOUtil.java:210)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:622)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:161)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:139)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:94)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:176)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:209)
at
org.apache.lucene.index.SegmentMerger.copyFieldsNoDeletions(SegmentMerger.java:424)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4053)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3645)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)


Some ptrace output:
23178 pread(172,
\270\316\276\2\245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2...,
4096, 98004192) = 40960.09
23178 pread(172,
\245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2...,
4096, 98004196) = 40960.09
23178 pread(172,
\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2...,
4096, 98004200) = 40960.08
23178 pread(172,
\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2...,
4096, 98004204) = 40960.08
23178 pread(172,
\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2...,
4096, 98004208) = 40960.08
23178 pread(172,
\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2...,
4096, 98004212) = 40960.09
23178 pread(172,
\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2...,
4096, 98004216) = 40960.08
23178 pread(172,
\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2...,
4096, 98004220) = 40960.09
23178 pread(172,
\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2...,
4096, 98004224) = 40960.13
22688... futex resumed  ) = -1 ETIMEDOUT (Connection timed
out)0.051276
23178 pread(172,
\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2\305\316\276\2...,
4096, 98004228) = 40960.10
22688 futex(0x464a9f28, FUTEX_WAKE_PRIVATE, 1
23178 pread(172,

RE: True master-master fail-over without data gaps

2011-03-08 Thread Jonathan Rochkind
I'd honestly think about buffer the incoming documents in some store that's 
actually made for fail-over persistence reliability, maybe CouchDB or 
something. And then that's taking care of not losing anything, and the problem 
becomes how we make sure that our solr master indexes are kept in sync with the 
actual persistent store; which I'm still not sure about, but I'm thinking it's 
a simpler problem. The right tool for the right job, that kind of failover 
persistence is not solr's specialty. 

From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent: Tuesday, March 08, 2011 11:45 PM
To: solr-user@lucene.apache.org
Subject: True master-master fail-over without data gaps

Hello,

What are some common or good ways to handle indexing (master) fail-over?
Imagine you have a continuous stream of incoming documents that you have to
index without losing any of them (or with losing as few of them as possible).
How do you set up you masters?
In other words, you can't just have 2 masters where the secondary is the
Repeater (or Slave) of the primary master and replicates the index periodically:
you need to have 2 masters that are in sync at all times!
How do you achieve that?

* Do you just put N masters behind a LB VIP, configure them both to point to the
index on some shared storage (e.g. SAN), and count on the LB to fail-over to the
secondary master when the primary becomes unreachable?
If so, how do you deal with index locks?  You use the Native lock and count on
it disappearing when the primary master goes down?  That means you count on the
whole JVM process dying, which may not be the case...

* Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters
with 2 separate indices in sync, while making sure you write to only 1 of them
via LB VIP or otherwise?

* Or ...


This thread is on a similar topic, but is inconclusive:
  http://search-lucene.com/m/aOsyN15f1qd1

Here is another similar thread, but this one doesn't cover how 2 masters are
kept in sync at all times:
  http://search-lucene.com/m/aOsyN15f1qd1

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: dismax, and too much qf?

2011-03-07 Thread Jonathan Rochkind
I use about that many qf's in Solr 1.4.1.   It works. I'm not entirely 
sure if it has performance implications -- I do have searching that is 
somewhat slower then I'd like, but I'm not sure if the lengthy qf is a 
contributing factor, or other things I'm doing (like a dozen different 
facet.fields too!).   I haven't profiled everything.  But it doesn't 
grind my Solr to a halt or anything, it works.


Seperately, I've also been thinking of other ways to get similar 
highlighting behavior as you describe, give the 'field' that the match 
was in in the highlight response, but haven't come up with anything 
great, if your approach works, that's cool.  I've been trying to think 
of a way to store a single stored field in a structured format (CSV? 
XML?), and somehow have the highlighter return the complete 'field' that 
matches, not just the surrounding X words. But haven't gotten anywhere 
on that, just an idle thought.


Jonathan

On 3/4/2011 10:09 AM, Jeff Schmidt wrote:

Hello:

I'm working on implementing a requirement where when a document is returned, we want to 
pithily tell the end user why. That is, say, with five documents returned, they may be so 
for similar or different reasons. These reasons are the field(s) in which 
matches occurred.  Some are more important than others, and I'll have to return just the 
most relevant one or two reasons to not overwhelm the user.

This is a separate goal than Solr's scoring of the returned documents. That is, 
index/query time boosting can indicate which fields are more significant in computing the 
overall document score, but then I need to know what fields where, matched with what 
terms. I do have an application that stands between Solr and the end user (RESTful API), 
so I figured I can rank the reasons and return more domain specific names 
rather than the Solr fields names.

So, I've turned to highlighting, and in the results I can see for each document ID 
the fields matched, and the text in the field etc. Great. But,  to get that to work, 
I have to specifically query individual fields. That is, the approach 
ofcopyField'ing a bunch of fields to a common text field for efficiency 
purposes is no longer an option. And, using the dismax request handler, I'm querying 
a lot of fields:

  str name=qf
 n_nameExact^4.0
 n_macromolecule_nameExact^3.0
 n_macromolecule_name^2.0
 n_macromolecule_id^1.8
 n_pathway_nameExact^1.5
 n_top_regulates
 n_top_regulated_by
 n_top_binds
 n_top_role_in_cell
 n_top_disease
 n_molecular_function
 n_protein_family
 n_subcell_location
 n_pathway_name
 n_cell_component
 n_bio_process
 n_synonym^0.5
 n_macromolecule_summary^0.6
 p_nameExact^4.0
 p_name^2.0
 p_description^0.6
  /str

Is that crazy?  Is telling Solr to look at so many individual fields going to 
be a performance problem?  I'm only prototyping at this stage and it works 
great. :)  I've not run anything yet at scale handling lots of requests.

There are two document types in that shared index, demarcated using a field 
named type.  So, when configuring the SolrJ SolrQuery, I do setup 
addFilterQuery() to select one or the other type.

Anyway, using dismax with all of those query fields along with highlighting, I 
get the information I need to render meaningful results for the end user.  But, 
it has a sort of smell to it. :)   Shall I look for another way, or am I 
worrying about nothing?

I am current using Solr 3.1 trunk.

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com




RE: Full Text Search with multiple index and complex requirements

2011-03-06 Thread Jonathan Rochkind
While it might be possible to work things out, not just one but several of your 
requirements are things that are difficult for Solr to do or which solr isn't 
really optimized to do. Are you sure you need an inverted indexing tool like 
Solr at all, as opposed to some kind of store (rdbms or nosql), for all or some 
parts of your data?  

From: Shrinath M [shrinat...@webyog.com]
Sent: Sunday, March 06, 2011 11:49 PM
To: rajini maski
Cc: solr-user@lucene.apache.org
Subject: Re: Full Text Search with multiple index and complex requirements

On Mon, Mar 7, 2011 at 9:56 AM, rajini maski rajinima...@gmail.com wrote:

 I just tried to answer your many questions, liking youe questions type..
 Answers attached to questions..

 Thank you Rajini, for your interest :)


 A) The data for every user is totally unrelated to every other user. This
 gives us few advantages:

   1. we can keep our indexes small in size.
  (using cores)
   2. merging/compatcting fragmented index will take less time.
 (merging is simple,one query)
   3. if some indexes becomes inaccessible for whatever reason
   (corruption?), only those users gets affected. Other users are unaffected
   and the service is available for them.
 yes it affects only that index others are unaffected


How many cores can we safely have on a machine ? How much is too much in
this case ?


 B) Each user can have few different types of data.

 So, our index hierarchy will look something like:
 /user1/type1/index files
 /user1/type2/index files
 /user2/type1/index files
 /user3/type3/index files

 I am not clear with point here..
 Example say you have 2users
 user1
  types- Name , Emailaddress, Phone number
 user2
  types- Name , Emailaddress, ID
 So you want to have user1 -3indexes plus  user2-3indexes  Total=6 indexes??
 If user1 type phone number is only one type in data index-- Then schema
 will be having only one data type number type



I just meant to say, like this :

/myself/docs/index_docs
/myself/spreadsheets/index_spreads
/yourself/docs/index_docs
/yourself/spreadsheets/index_spreads

You get the idea right ?

C) Often, probably with every itereation, we'll add types of data that can
 be indexed.
 So we want to have an efficient/programmatic way to add schemas for
 different types. We would like to avoid having fixed schema for indexing.

 you added a type say DATE
 Before you start indexing for this date type, u need to update your
 schema with this data type to enable indexing .. correct ?
 So this wont need a fixed schema defined priorly, we can add this only when
 you want to add this data type..  But this requires the service restart..
 This wont effect current index other then adding to it..


Today I am adding only docs and spreadsheets, tomorrow I may want to add
something else, something from RDBMS for example, then I don't want
to sit tinkering with schema.xml and I wouldn't like a service restart
either...



 D) The users can fire search queries which will search either: - Within a
 specific type for that user - Across all types for that user: in this
 case
 we want to fire a parallel query like Lucene has.
 (ParallelMultiSearcher
 http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
 
 )


 Shradding in solr workd like this :
 You have phone number detail in one index and again phone number details
 only in other index too..
 You can search across both index firing a query as , Ph: across index1
 and index2
 You cannot fire one search query as :  Name:xyz and Ph: across index
 one and index2 .. when index one has datatype defined for only name and
 index2 has only for phone number.. This can only be done if you define in
 schema the datatypes for both..(this will create a prob of having same/fixed
 schema)


 E) We require real time update for the index. *This is a must.*
 This can be possible .. Index happening must be enabled every minute ,
 Check if updates made.. If made, re-index and maintain unique ness with the
 userid



 We were considering Lucene, Sphinx and Solr to do this. This is what we
 found:

   - Sphinx: No efficient way to do A, B, C, F. Or is there?
   - Luecne: Everything looks possible, as it is very low level. But we have
   to write wrappers to do F and build a communication layer between the web
   server and the search server.
   - Solr: Not sure if we can do A, B, C easily. Can we?

 So, my question is what is the best software for the above requirements? I
 am inclined more towards Solr and then Lucene if we get all the
 requirements.


 Regards,
 Rajani Maski








 On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M shrinat...@webyog.com wrote:

 We are building an application which will require us to index data for
 each
 of our users so that we can provide full text search on their data. Here
 are
 some notable things about the application:

 A) The data for every user is totally unrelated to every other user. This
 gives us 

RE: Model foreign key type of search?

2011-03-04 Thread Jonathan Rochkind
Yep, it's tricky to do this sort of thing in Solr. 

One way to do it would be to try and reindex the main item on some regular 
basis with the keywords/comments actually flattened into the main record. Maybe 
along with a field for number_of_comments, so you can boost on that or what 
have you.  If you can figure out a way to do that, it would be easiest/most 
reliable without fighting Solr.  Beware that it's difficult to set up a Solr 
that has very frequent commits though, you might want to batch the updates 
every hour or half hour or what have you. 

Another thing to look at is this patch which supports a limited type of 'join' 
in Solr. I'm not sure it's current status of maturity, and I'm not sure if it 
would work in your use case or not. 
https://issues.apache.org/jira/browse/SOLR-2272

And, if your alternative is writing your own thing from scratch, another option 
would be instead writing new components in Java for Solr to try and do what you 
want.  If you can understand the structure and features of the lucene index 
underlying Solr, and figure out a way to get the functionality you want from 
lucene, then that's the first step to figuring out how to write a component for 
Solr to expose it.  

From: alex.d...@gmail.com [alex.d...@gmail.com] On Behalf Of Alex Dong 
[a...@trunk.ly]
Sent: Friday, March 04, 2011 12:56 AM
To: Gora Mohanty
Cc: solr-user@lucene.apache.org
Subject: Re: Model foreign key type of search?

Gora, thanks for the quick reply.

Yes, I'm aware of the differences between Solr vs. DBMS. We've actually
written some c++ analytical engine that can process through a billion tweets
with multiple facets drill down. We may end up cook our own in the end but
so far solr suites our needs quite well.  The multi-lingual tokenizer and
tika integration are all too addictive.

What you're suggesting is exactly what I'm doing. Trying to use dynamic
fields and copyTo to get all the information into one field, then run the
search over that.

However, this is not good enough.  Allow me to elaborate this using the same
Paris example again.  Let's say two urls, first has 10 people bookmarked and
second has 100. Let's say these two have roughly similar score if we squeeze
them into one single field. Then I'd like to rank the one with more users
higher.

Another way to look at this is PageRank relies on the the number and anchor
text of the incoming link, we're trying to use the number of people and
their keywords/comments as a weight for the link.

Alex


On Fri, Mar 4, 2011 at 6:29 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, Mar 4, 2011 at 10:24 AM, Alex Dong a...@trunk.ly wrote:
  Hi there,  I need some advice on how to implement this using solr:
 
  We have two tables: urls and bookmarks.
  - Each url has four fields:  {guid, title, text, url}
  - One url will have one or more bookmarks associated with it. Each
 bookmark
  has these: {link.guid, user, tags, comment}
 
  I'd like to return matched urls based on not only the title, text from
 the
  url schema, but also some kind of aggregated popularity score based on
 all
  bookmarks for the same url. The popularity score should base on
  number/frequency of bookmarks that match the query.
 [...]

 It is best not to think of Solr as a RDBMS, and not to try to graft
 RDBMS practices on to it. Instead, you should flatten your data,
 e.g., in the above, you could have:
 * Four single-valued fields: guid, title, text, url
 * Four multi-valued fields: bookmark_guid, bookmark_user,
  bookmark_tags, bookmark_comment
 Your index would contain one record per guid of the URL,
 and you would need to populate the multi-valued bookmark
 fields from all bookmark instances associated with that URL.

 Then one could either copy the relevant search fields to a full-text
 search field, and search only on that, or, e.g., search on bookmark_tags
 and bookmark_comment in addition to searching on title, and text.

 Regards,
 Gora



RE: When Index is Updated Frequently

2011-03-04 Thread Jonathan Rochkind
If you can make that solution work for you, I think it is a wise one which will 
serve you well. In some cases that solution won't work, because you _need_ the 
frequently changing data in Solr to be searched against in Solr.  But if you 
can get away without that, I think you will be well-served by keeping any data 
that doesn't need to be searched against by Solr in an external non-Solr store. 
It's really rarely a bad plan to just put in Solr what needs to be searched 
against in Solr -- whether or not the 'other' stuff changes frequently. 

Only you (if anyone!) know enough about your requirements and plans to know how 
much of a problem it will be to have your 'mutable' data not in Solr, and thus 
not searchable with Solr. 

From: Bing Li [lbl...@gmail.com]
Sent: Friday, March 04, 2011 3:21 PM
To: Michael McCandless
Cc: solr-user@lucene.apache.org
Subject: Re: When Index is Updated Frequently

Dear Michael,

Thanks so much for your answer!

I have a question. If Lucene is good at updating, it must more loads on the
Solr cluster. So in my system, I will leave the large amount of crawled data
unchanged for ever. Meanwhile, I use a traditional database to keep mutable
data.

Fortunately, in most Internet systems, the amount of mutable data is much
less than that of immutable one.

How do you think about my solution?

Best,
LB

On Sat, Mar 5, 2011 at 2:45 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Fri, Mar 4, 2011 at 10:09 AM, Bing Li lbl...@gmail.com wrote:

  According to my experiences, when the Lucene index updated frequently,
 its
  performance must become low. Is it correct?

 In fact Lucene can gracefully handle a high rate of updates with low
 latency turnaround on the readers, using the near-real-time (NRT) API
 -- IndexWriter.getReader() (or in soon-to-be 31,
 IndexReader.open(IndexWriter)).

 NRT is really something a hybrid of eventual consistency and
 immediate consistency, because it lets your app have full control
 over how quickly changes must be visible by controlling when you
 pull a new NRT reader.

 That said, Lucene can't offer true immediate consistency at a high
 update rate -- the time to open a new NRT reader is usually too costly
 to do, eg, for every search.  But eg every 100 msec (say) is
 reasonable (depending on many variables...).

 So... for your app you should run some tests and see.  And please
 report back.

 (But, unfortunately, NRT hasn't been exposed in Solr yet...).

 --
 Mike

 http://blog.mikemccandless.com



Re: uniqueKey merge documents on commit

2011-03-03 Thread Jonathan Rochkind

Nope, there is not.

On 3/3/2011 10:55 AM, Tim Gilbert wrote:

Hi,



I have a unique key within my index, but rather than the default
behavour of overwriting I am wondering if there is a method to merge
the two different documents on commit of the second document.  I have a
testcase which explains what I'd like to happen:



@Test

   public void testMerge() throws SolrServerException, IOException

   {

 SolrInputDocument doc1 = new SolrInputDocument();

 doc1.addField(secid, testid);

 doc1.addField(value1_i, 1);



 SolrAllSec.GetSolrServer().add(doc1);

 SolrAllSec.GetSolrServer().commit();



 SolrInputDocument doc2 = new SolrInputDocument();

 doc2.addField(secid, testid);

 doc2.addField(value2_i, 2);



 SolrAllSec.GetSolrServer().add(doc2);

 SolrAllSec.GetSolrServer().commit();



 SolrQuery solrQuery = new  SolrQuery();

 solrQuery = solrQuery.setQuery(secid:testid);

 QueryResponse response =
SolrAllSec.GetSolrServer().query(solrQuery, METHOD.GET);



 ListSolrDocument  result = response.getResults();

 Assert.isTrue(result.size() == 1);

 Assert.isTrue(result.contains(value1));

 Assert.isTrue(result.contains(value2));

   }



Other than reading doc1 and adding the fields from doc2 and
recommitting, is there another way?



Thanks in advance,



Tim






Re: FilterQuery OR statement

2011-03-03 Thread Jonathan Rochkind
You might also consider splitting your two seperate AND clauses into 
two seperate fq's:


fq=field1:(1 OR 2 OR 3 OR 4)
fq=field2:(4 OR 5 OR 6 OR 7)

That will cache the two seperate clauses seperately in the field cache, 
which is probably preferable in general, without knowing more about your 
use characteristics.


ALSO, instead of either supplying the OR explicitly as above, OR 
changing the default operator in schema.xml for everything, I believe it 
would work to supply it as a local param:


fq={q.op=OR}field1:(1 2 3 4)

If you want to do that.

AND, your question, can you search without a 'q'?  No, but you can 
search with a 'q' that selects all documents, to be limited by the fq's.


q=[* TO *]

On 3/3/2011 1:14 PM, Tanner Postert wrote:

That worked, thought I tried it before, not sure why it didn't before.

Also, is there a way to query without a q parameter?

I'm just trying to pull back all of the field results where field1:(1 OR 2
OR 3) etc. so I figured I'd use the FQ param for caching purposes because
those queries will likely be run a lot, but if I leave the Q parameter off i
get a null pointer error.

On Thu, Mar 3, 2011 at 11:05 AM, Ahmet Arslaniori...@yahoo.com  wrote:


Trying to figure out how I can run
something similar to this for the fq
parameter

Field1 in ( 1, 2, 3 4 )
AND
Field2 in ( 4, 5, 6, 7 )

I found some examples on the net that looked like this:
fq=+field1:(1 2 3
4) +field2(4 5 6 7) but that yields no results.

May be your default operator is set to AND in schema.xml?
If yes, try using +field2(4 OR 5 OR 6 OR 7)






Re: multiple localParams for each query clause

2011-03-02 Thread Jonathan Rochkind
Not per clause, no. But you can use the nested queries feature to set 
local params for each nested query instead.  Which is in fact one of the 
most common use cases for local params.


q=_query_:{type=x q.field=z}something AND 
_query_:{!type=database}something


URL encode that whole thing though.

http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

On 3/2/2011 10:24 AM, Roman Chyla wrote:

Hi,

Is it possible to set local arguments for each query clause?

example:

{!type=x q.field=z}something AND {!type=database}something


I am pulling together result sets coming from two sources, Solr index
and DB engine - however I realized that local parameters apply only to
the whole query - so I don't know how to set the query to mark the
second clause as db-searchable.

Thanks,

   Roman


Re: multi-core solr, specifying the data directory

2011-03-02 Thread Jonathan Rochkind
Meanwhile, I'm having trouble getting the expected behavior at all. I'll 
try to give the right details (without overwhelming with too many), if 
anyone can see what's going on.


Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at 
/opt/solr/solr_indexer/solr.xml


The solr.xml includes actually only one core, let's start out nice and 
simple:


cores adminPath=/admin/cores
core name=master_prod instanceDir=master_prod
property name=enable.master value=true /
/core
/cores

[The enable.master thing is a custom property my solrconfig.xml uses in 
places unrelated to dataDir]


1. First try, the  solrconfig at 
/opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO 
dataDir element at all.


WOAH. It just worked. Go figure. I don't know what I tried differently 
before, maybe Mike is right that people (including me) get confused by 
the dataDir element being there, and needing to delete it entirely to 
get that default behavior.


So anyway yeah. Sorry, thanks, appears to be working, although 
possibly confusing for the newbie to set up for reasons that aren't 
entirely clear, since several of us in this thread had trouble getting 
it right.


On 3/2/2011 2:42 PM, Mike Sokolov wrote:

Yes - I commented out thedataDir  element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.

It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml

The solrconfig.xml's in  example/multicore/ don't have thedataDir
element.

-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:!-- Used to specify an alternate directory to hold all index data
:other than the default ./data under the Solr home.
:If replication is in use, this should match the replication
: configuration
: . --
:dataDir${solr.data.dir:./solr/data}/dataDir

that directive says use the solr.data.dir system property to pick a path,
if it is not set, use ./solr/data (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

 dataDir${solr.data.dir:}/dataDir

or...

 dataDir/dataDir


-Hoss



Re: multi-core solr, specifying the data directory

2011-03-02 Thread Jonathan Rochkind
I wonder if what doesn't work is trying to set an explicit relative path 
there, instead of using the baked in default data.  If you set an 
explicit relative path, is it relative to the current core solr.home, or 
to the main solr.home?


Let's try it to see Yep, THAT's what doesn't work, and probably what 
I was trying to do before.


In solrconfig.xml for a core, I do dataDirdata/dataDir.

I expected that would be interpreted relative to current core solr.home, 
but it is, judging by the log files, instead based on the 'main' 
solr.home (above the cores, where the solr.xml is) -- or maybe even on 
some other value, the tomcat base url or something?


Is _that_ a bug?

On 3/2/2011 3:38 PM, Jonathan Rochkind wrote:

Meanwhile, I'm having trouble getting the expected behavior at all. I'll
try to give the right details (without overwhelming with too many), if
anyone can see what's going on.

Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at
/opt/solr/solr_indexer/solr.xml

The solr.xml includes actually only one core, let's start out nice and
simple:

cores adminPath=/admin/cores
core name=master_prod instanceDir=master_prod
property name=enable.master value=true /
/core
/cores

[The enable.master thing is a custom property my solrconfig.xml uses in
places unrelated to dataDir]

1. First try, the  solrconfig at
/opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO
dataDir element at all.

WOAH. It just worked. Go figure. I don't know what I tried differently
before, maybe Mike is right that people (including me) get confused by
thedataDir  element being there, and needing to delete it entirely to
get that default behavior.

So anyway yeah. Sorry, thanks, appears to be working, although
possibly confusing for the newbie to set up for reasons that aren't
entirely clear, since several of us in this thread had trouble getting
it right.

On 3/2/2011 2:42 PM, Mike Sokolov wrote:

Yes - I commented out thedataDir   element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.

It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml

The solrconfig.xml's in  example/multicore/ don't have thedataDir
element.

-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:!-- Used to specify an alternate directory to hold all index data
:other than the default ./data under the Solr home.
:If replication is in use, this should match the replication
: configuration
: . --
:dataDir${solr.data.dir:./solr/data}/dataDir

that directive says use the solr.data.dir system property to pick a path,
if it is not set, use ./solr/data (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

  dataDir${solr.data.dir:}/dataDir

or...

  dataDir/dataDir


-Hoss



Re: multi-core solr, specifying the data directory

2011-03-01 Thread Jonathan Rochkind
I did try that, yes. I tried that first in fact!  It seems to fall back 
to a ./data directory relative to the _main_ solr directory (the one 
above all the cores), not the core instancedir.  Which is not what I 
expected either.


I wonder if this should be considered a bug? I wonder if anyone has 
considered this and thought of changing/fixing it?


On 3/1/2011 4:23 AM, Jan Høydahl wrote:

Have you tried removing thedataDir  tag from solrconfig.xml? Then it should 
fall back to default ./data relative to core instancedir.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 1. mars 2011, at 00.00, Jonathan Rochkind wrote:


Unless I'm doing something wrong, in my experience in multi-core Solr in 1.4.1, 
you NEED to explicitly provide an absolute path to the 'data' dir.

I set up multi-core like this:

cores adminPath=/admin/cores
core name=some_core instanceDir=some_core
/core
/cores


Now, setting instanceDir like that works for Solr to look for the 'conf' 
directory in the default location you'd expect, ./some_core/conf.

You'd expect it to look for the 'data' dir for an index in ./some_core/data 
too, by default.  But it does not seem to. It's still looking for the 'data' 
directory in the _main_ solr.home/data, not under the relevant core directory.

The only way I can manage to get it to look for the /data directory where I 
expect is to spell it out with a full absolute path:

core name=some_core instanceDir=some_core
property name=dataDir value=/path/to/main/solr/some_core/data /
/core

And then in the solrconfig.xml do adataDir${dataDir}/dataDir

Is this what everyone else does too? Or am I missing a better way of doing this?  I would 
have thought it would just work, with Solr by default looking for a ./data 
subdir of the specified instanceDir.  But it definitely doesn't seem to do that.

Should it? Anyone know if Solr in trunk past 1.4.1 has been changed to do what 
I expect? Or am I wrong to expect it? Or does everyone else do multi-core in 
some different way than me where this doesn't come up?

Jonathan





Re: multi-core solr, specifying the data directory

2011-03-01 Thread Jonathan Rochkind
Hmm, okay, have to try to find time to install the example/multicore and 
see.


It's definitely never worked for me, weird.

Thanks.

On 3/1/2011 2:38 PM, Chris Hostetter wrote:

: Unless I'm doing something wrong, in my experience in multi-core Solr in
: 1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir.

have you looked at the example/multicore directory that was included in
the 1.4.1 release?

it has a solr.xml that loads two cores w/o specifying a data dir in the
solr.xml (or hte solrconfig.xml) and it uses the data dir inside the
specified instanceDir.

If that example works for you, but your own configs do not, then we'll
need more details about your own configs -- how are you running solr, what
does the solrconfig.xml of the core look like, etc...


-Hoss



Re: solr different sizes on master and slave

2011-03-01 Thread Jonathan Rochkind
The slave should not keep multiple copies _permanently_, but might 
temporarily after it's fetched the new files from master, but before 
it's committed them and fully wamred the new index searchers in the 
slave.  Could that be what's going on, is your slave just still working 
on committing and warming the new version(s) of the index?


[If you do 'commit' to slave (and a replication pull counts as a 
'commit') so quick that you get overlapping commits before the slave was 
able to warm a new index... its' going to be trouble all around.]


On 3/1/2011 4:27 PM, Mike Franon wrote:

ok doing some more research I noticed, on the slave it has multiple
folders where it keeps them for example

index
index.20110204010900
index.20110204013355
index.20110218125400

and then there is an index.properties that shows which index it is using.

I am just curious why does it keep multiple copies?  Is there a
setting somewhere I can change to only keep one copy so not to lose
space?

Thanks

On Tue, Mar 1, 2011 at 3:26 PM, Mike Franonkongfra...@gmail.com  wrote:

No pending commits, what it looks like is there are almost two copies
of the index on the master, not sure how that happened.



On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
markus.jel...@openindex.io  wrote:

Are there pending commits on the master?


I was curious why would the size be dramatically different even though
the index versions are the same?

One is 1.2 Gb, and on the slave it is 512 MB

I would think they should both be the same size no?

Thanks


Re: Query on multivalue field

2011-03-01 Thread Jonathan Rochkind
Each token has a position set on it. So if you index the value alpha 
beta gamma, it winds up stored in Solr as (sort of, for the way we want 
to look at it)


document1:
alpha:position 1
beta:position 2
gamma: postition 3

 If you set the position increment gap large, then after one value in a 
multi-valued field ends, the position increment gap will be added to the 
positions for the next value. Solr doesn't actually internally have much 
of any idea of a multi-valued field, ALL a multi-valued indexed field 
is, is a position increment gap seperating tokens from different 'values'.


So index in a multi-valued field, with position increment gap 1,  
the values:  [alpha beta gamma, aleph bet], you get kind of like:


document1:
alpha: 1
beta: 2
gamma: 3
aleph: 10004
bet: 10005

A large position increment gap, as far as I know and can tell (please 
someone correct me if I'm wrong, I am not a Solr developer) has no 
effect on the size or efficiency of your index on disk.


I am not sure why positionIncrementGap doesn't just default to a very 
large number, to provide behavior that more matches what people expect 
from the idea of a multi-valued field. So maybe there is some flaw in 
my understanding, that justifies some reason for it not to be this way?


But I set my positionIncrementGap very large, and haven't seen any issues.


On 3/1/2011 5:46 PM, Scott Yeadon wrote:

The only trick with this is ensuring the searches return the right
results and don't go across value boundaries. If I set the gap to the
largest text size we expect (approx 5000 chars) what impact does such a
large value have (i.e. does Solr physically separate these fragments in
the index or just apply the figure as part of any query?

Scott.

On 2/03/11 9:01 AM, Ahmet Arslan wrote:

In a multiValued field, call it field1, if I have two
values indexed to
this field, say value 1 = some text...termA...more text
and value 2 =
some text...termB...more text and do a search such as
field1:(termA termB)
(wheresolrQueryParser defaultOperator=AND/) I'm
getting a hit
returned even though both terms don't occur within a single
value in the
multiValued field.

What I'm wondering is if there is a way of applying the
query against
each value of the field rather than against the field in
its entirety.
The reason being is the number of values I want to store is
variable and
I'd like to avoid the use of dynamic fields or
restructuring the index
if possible.

Your best bet can be using positionIncrementGap and to issue a phrase query 
(implicit AND) with the appropriate slop value.

Ff you have positionIncrementGap=100, you can simulate this with using
q=field1:termA termB~100

http://search-lucene.com/m/Hbdvz1og7D71/








Re: multi-core solr, specifying the data directory

2011-03-01 Thread Jonathan Rochkind
This definitely matches my own experience, and I've heard it from 
others. I haven't heard of anyone who HAS gotten it to work like that.  
But apparently there's a distributed multi-core example which claims to 
work like it doesn't for us.


One of us has to try the Solr distro multi-core example, as Hoss 
suggested/asked, to see if the problem exhibits even there, and if not, 
figure out what the difference is.  Sorry, haven't found time to figure 
out how to install and start up the demo.


I am running in Tomcat, I wonder if container could matter, and maybe it 
somehow works in Jetty or something?


Jonathan


On 3/1/2011 7:05 PM, Michael Sokolov wrote:

I tried this in my 1.4.0 installation (commenting out what had been
working, hoping the default would be as you said works in the example):

solr persistent=true sharedLib=lib
cores adminPath=/admin/cores
core name=bpro instanceDir=bpro
!--property name=solr.data.dir value=solr/bpro/data/  --
/core
core name=pfapp instanceDir=pfapp
property name=solr.data.dir value=solr/pfapp/data/
/core
/cores
/solr

In the log after starting up, I get these messages (among many others):

...

Mar 1, 2011 7:51:23 PM org.apache.solr.core.CoreContainer$Initializer
initialize
INFO: looking for solr.xml: /usr/local/tomcat/solr/solr.xml
Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: No /solr/home in JNDI
Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: solr home defaulted to 'solr/' (could not find system property or
JNDI)
Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoaderinit
INFO: Solr home set to 'solr/'

Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoaderinit
INFO: Solr home set to 'solr/bpro/'
...
Mar 1, 2011 7:51:24 PM org.apache.solr.core.SolrCoreinit
INFO: [bpro] Opening new SolrCore at solr/bpro/, dataDir=./solr/data/
...
Mar 1, 2011 7:51:25 PM org.apache.solr.core.SolrResourceLoaderinit
INFO: Solr home set to 'solr/pfapp/'
...
Mar 1, 2011 7:51:26 PM org.apache.solr.core.SolrCoreinit
INFO: [pfapp] Opening new SolrCore at solr/pfapp/, dataDir=solr/pfapp/data/

and it's pretty clearly using the wrong directory at that point.

Some more details:

/usr/local/tomcat has the usual tomcat distribution (this is 6.0.29)
conf/server.xml has:
Host name=localhost  appBase=webapps
  unpackWARs=true autoDeploy=true
  xmlValidation=false xmlNamespaceAware=false

Aliasrosen/Alias
Aliasrosen.ifactory.com/Alias
Context path= docBase=/usr/local/tomcat/webapps/solr /

/Host

There is a solrconfig.xml in each of the core directories (should there
only be one of these?).  I believe these are pretty generic (and they
are identical); the one in the bpro folder has:

!-- Used to specify an alternate directory to hold all index data
 other than the default ./data under the Solr home.
 If replication is in use, this should match the replication
configuration
. --
dataDir${solr.data.dir:./solr/data}/dataDir



-Mike

On 3/1/2011 4:38 PM, Jonathan Rochkind wrote:

Hmm, okay, have to try to find time to install the example/multicore
and see.

It's definitely never worked for me, weird.

Thanks.

On 3/1/2011 2:38 PM, Chris Hostetter wrote:

: Unless I'm doing something wrong, in my experience in multi-core
Solr in
: 1.4.1, you NEED to explicitly provide an absolute path to the
'data' dir.

have you looked at the example/multicore directory that was included in
the 1.4.1 release?

it has a solr.xml that loads two cores w/o specifying a data dir in the
solr.xml (or hte solrconfig.xml) and it uses the data dir inside the
specified instanceDir.

If that example works for you, but your own configs do not, then we'll
need more details about your own configs -- how are you running solr,
what
does the solrconfig.xml of the core look like, etc...


-Hoss





setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
So I think I ought to be able to set up a particular solr core to use a 
different file for solrconfig.xml.


(The reason I want to do this is so I can have master and slave in 
replication have the exact same repo checkout for their conf directory, 
but have the master using a different solrconfig.xml, one set up to be 
master.)


Solr 1.4.1, using this for guidance: http://wiki.apache.org/solr/CoreAdmin

But no matter what I try, while I get no errors in the log file (should 
I be looking for errors somewhere else?), the core doesn't successfully 
come up.


I am trying in the solr.xml, to do this:

core name=master_prod instanceDir=master_prod 
config=master-solrconfig.xml

property name=dataDir value=/opt/solr/solr_indexer/master_prod/data /
/core

Or I try this instead:

core name=master_prod instanceDir=master_prod 
config=master-solrconfig.xml

property name=dataDir value=/opt/solr/solr_indexer/master_prod/data /
property name=configName value=master-solrconfig.xml /
/core

With either of these, in the log file things look like they started up 
succesfully but it doesn't appear to actually be so, the core is 
actually inaccessible. Maybe there's an error in my 
master-solrconfig.xml, but I don't think so, and there's nothing in the 
log on that either.  Or maybe I'm not doing things right as far as 
telling it to use the 'config file' solrconfig.xml in a different location.


Can anyone confirm for me that this is possible, and what the right way 
to try and do it is?


Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind

On 2/28/2011 1:09 PM, Ahmet Arslan wrote:

(The reason I want to do this is so I can have master and
slave in replication have the exact same repo checkout for
their conf directory, but have the master using a different
solrconfig.xml, one set up to be master.)

How about using same solrconfig.xml for master too? As described here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node



That isn't great, becuase there are more differences in optimal 
solrconfig.xml between master and slave than just the replication 
handler difference, which that URL covers.


A master (which won't be queried against) doesn't need spellcheck 
running after commits, but the slave does. A master doesn't need slow 
newsearcher/firstsearcher query warm-ups, but the slave does. The master 
may be better with different (lower) cache settings, since it won't be 
used to service live queries.


The documentation clearly suggests it _ought_ to be possible to tell a 
core the name of it's config file (default solrconfig.xml) to be 
something other than solrconfig.xml -- but I havent' been able to make 
it work, and find the lack of any errors in the log file when it's not 
working to be frustrating.


Has anyone actually done this? Can anyone confirm that it's even 
possible, and the documentation isn't just taking me for a ride?




Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
Okay, I did manage to find a clue from the log that it's not working, 
when it's not working:


INFO: Jk running ID=0 time=0/66  config=null

config=null, that's not right.  When I try to over-ride the config file 
name in solr.xml core config, I can't seem to put a name in there that 
works to find a file that does actually exist.  Unless I put the name 
solrconfig.xml in there, then it works fine, heh.




On 2/28/2011 3:00 PM, Jonathan Rochkind wrote:

On 2/28/2011 1:09 PM, Ahmet Arslan wrote:

(The reason I want to do this is so I can have master and
slave in replication have the exact same repo checkout for
their conf directory, but have the master using a different
solrconfig.xml, one set up to be master.)

How about using same solrconfig.xml for master too? As described here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node


That isn't great, becuase there are more differences in optimal
solrconfig.xml between master and slave than just the replication
handler difference, which that URL covers.

A master (which won't be queried against) doesn't need spellcheck
running after commits, but the slave does. A master doesn't need slow
newsearcher/firstsearcher query warm-ups, but the slave does. The master
may be better with different (lower) cache settings, since it won't be
used to service live queries.

The documentation clearly suggests it _ought_ to be possible to tell a
core the name of it's config file (default solrconfig.xml) to be
something other than solrconfig.xml -- but I havent' been able to make
it work, and find the lack of any errors in the log file when it's not
working to be frustrating.

Has anyone actually done this? Can anyone confirm that it's even
possible, and the documentation isn't just taking me for a ride?




Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
Yeah, I'm actually _not_ trying to get replication to copy over the 
config files.  Instead, I'm assuming the config files are all there, and 
I'm actually trying to get one of the cores to _use_ a file that 
actually on disk in that core is called, eg, solrconfig_slave.xml.


This wiki page: http://wiki.apache.org/solr/CoreAdmin

suggests I _ought_ to be able to do that, to tell a particular core to 
use a config file of any name I want. But I'm having trouble getting it 
to work. But that could be my own local mistake of some kind too. Just 
makes it harder to figure out when I'm not even exactly sure how you're 
_supposed_ to be able to do that -- CoreAdmin wiki page implies at least 
two different ways you should be able to do it, but doesn't include an 
actual example so I'm not sure if I'm understanding what it's implying 
correctly -- or if the actual 1.4.1 behavior matches what's in that wiki 
page anyway.


On 2/28/2011 3:14 PM, Dyer, James wrote:

Jonathan,

When I was first setting up replication a couple weeks ago, I had this working, 
as described here: 
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml

I created the slave's solrconfig.xml and saved it on the master in the conf dir as 
solrconfig_slave.xml, then began the confFiles parameter on the master with 
solrconfig_slave.xml:solrconfig.xml,schema.xml,etc.  And it was working (v1.4.1).  I'm not sure why you haven't had 
good luck with this but you can at least know it is possible to get it to work.

I think to get the slave up and running for the first time I saved the slave's version on the slave as 
solrconfig.xml.  It then would copy over any changed versions of solrconfig_slave.xml 
from the master to the slave, saving them on the slave as solrconfig.xml.  But I primed it by 
giving it its config file in-sync to start with.

I ended up going the same-config-file-everywhere route though because we're 
using our master to handle requests when its not indexing (one less server to 
buy)...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Monday, February 28, 2011 2:03 PM
To: solr-user@lucene.apache.org
Subject: Re: setting different solrconfig.xml for a core

Okay, I did manage to find a clue from the log that it's not working,
when it's not working:

INFO: Jk running ID=0 time=0/66  config=null

config=null, that's not right.  When I try to over-ride the config file
name in solr.xml core config, I can't seem to put a name in there that
works to find a file that does actually exist.  Unless I put the name
solrconfig.xml in there, then it works fine, heh.



On 2/28/2011 3:00 PM, Jonathan Rochkind wrote:

On 2/28/2011 1:09 PM, Ahmet Arslan wrote:

(The reason I want to do this is so I can have master and
slave in replication have the exact same repo checkout for
their conf directory, but have the master using a different
solrconfig.xml, one set up to be master.)

How about using same solrconfig.xml for master too? As described here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node


That isn't great, becuase there are more differences in optimal
solrconfig.xml between master and slave than just the replication
handler difference, which that URL covers.

A master (which won't be queried against) doesn't need spellcheck
running after commits, but the slave does. A master doesn't need slow
newsearcher/firstsearcher query warm-ups, but the slave does. The master
may be better with different (lower) cache settings, since it won't be
used to service live queries.

The documentation clearly suggests it _ought_ to be possible to tell a
core the name of it's config file (default solrconfig.xml) to be
something other than solrconfig.xml -- but I havent' been able to make
it work, and find the lack of any errors in the log file when it's not
working to be frustrating.

Has anyone actually done this? Can anyone confirm that it's even
possible, and the documentation isn't just taking me for a ride?




Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
Aha, wait, I think I've made it work, as simple as this in the solr.xml 
core config, to make a core use a solrconfig.xml file with a different name:


... core name=master_prod instanceDir=master_prod 
config=master-solrconfig.xml ...


Not sure why that didn't work the first half a dozen times I tried. May 
have had a syntax error in my master-solrconfig.xml file, even though 
the Solr log files didn't report any, maybe when there's a syntax error 
Solr just silently gives up on the config file and presents an empty 
index, I dunno.


On 2/28/2011 3:46 PM, Jonathan Rochkind wrote:

Yeah, I'm actually _not_ trying to get replication to copy over the
config files.  Instead, I'm assuming the config files are all there, and
I'm actually trying to get one of the cores to _use_ a file that
actually on disk in that core is called, eg, solrconfig_slave.xml.

This wiki page: http://wiki.apache.org/solr/CoreAdmin

suggests I _ought_ to be able to do that, to tell a particular core to
use a config file of any name I want. But I'm having trouble getting it
to work. But that could be my own local mistake of some kind too. Just
makes it harder to figure out when I'm not even exactly sure how you're
_supposed_ to be able to do that -- CoreAdmin wiki page implies at least
two different ways you should be able to do it, but doesn't include an
actual example so I'm not sure if I'm understanding what it's implying
correctly -- or if the actual 1.4.1 behavior matches what's in that wiki
page anyway.

On 2/28/2011 3:14 PM, Dyer, James wrote:

Jonathan,

When I was first setting up replication a couple weeks ago, I had this working, 
as described here: 
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml

I created the slave's solrconfig.xml and saved it on the master in the conf dir as 
solrconfig_slave.xml, then began the confFiles parameter on the master with 
solrconfig_slave.xml:solrconfig.xml,schema.xml,etc.  And it was working (v1.4.1).  I'm not sure why you haven't had 
good luck with this but you can at least know it is possible to get it to work.

I think to get the slave up and running for the first time I saved the slave's version on the slave as 
solrconfig.xml.  It then would copy over any changed versions of solrconfig_slave.xml 
from the master to the slave, saving them on the slave as solrconfig.xml.  But I primed it by 
giving it its config file in-sync to start with.

I ended up going the same-config-file-everywhere route though because we're 
using our master to handle requests when its not indexing (one less server to 
buy)...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Monday, February 28, 2011 2:03 PM
To: solr-user@lucene.apache.org
Subject: Re: setting different solrconfig.xml for a core

Okay, I did manage to find a clue from the log that it's not working,
when it's not working:

INFO: Jk running ID=0 time=0/66  config=null

config=null, that's not right.  When I try to over-ride the config file
name in solr.xml core config, I can't seem to put a name in there that
works to find a file that does actually exist.  Unless I put the name
solrconfig.xml in there, then it works fine, heh.



On 2/28/2011 3:00 PM, Jonathan Rochkind wrote:

On 2/28/2011 1:09 PM, Ahmet Arslan wrote:

(The reason I want to do this is so I can have master and
slave in replication have the exact same repo checkout for
their conf directory, but have the master using a different
solrconfig.xml, one set up to be master.)

How about using same solrconfig.xml for master too? As described here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node


That isn't great, becuase there are more differences in optimal
solrconfig.xml between master and slave than just the replication
handler difference, which that URL covers.

A master (which won't be queried against) doesn't need spellcheck
running after commits, but the slave does. A master doesn't need slow
newsearcher/firstsearcher query warm-ups, but the slave does. The master
may be better with different (lower) cache settings, since it won't be
used to service live queries.

The documentation clearly suggests it _ought_ to be possible to tell a
core the name of it's config file (default solrconfig.xml) to be
something other than solrconfig.xml -- but I havent' been able to make
it work, and find the lack of any errors in the log file when it's not
working to be frustrating.

Has anyone actually done this? Can anyone confirm that it's even
possible, and the documentation isn't just taking me for a ride?




Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
And in other news of other possibilities. If I DID want to use the same 
solrconfig.xml for both master and slave, but disable the 
newsearcher/firstsearcher queries on master, it _looks_ like I can use 
the techique here:


http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node

Applied to newsearcher/firstsearcher too:

listener event=firstSearcher class=solr.QuerySenderListener  
enable=${enable.slave:false}


Now that listener will only be turned on if enable.slave is set to true. 
Might make more sense to use a different property value there, like 
enable.searcher or something.


I'm not entirely sure in what places the enable attribute is 
recognized and in what places it isn't, but it LOOKS like it's 
recognized on the listener tag.  I think.



On 2/28/2011 3:52 PM, Jonathan Rochkind wrote:

Aha, wait, I think I've made it work, as simple as this in the solr.xml
core config, to make a core use a solrconfig.xml file with a different name:

...core name=master_prod instanceDir=master_prod
config=master-solrconfig.xml  ...

Not sure why that didn't work the first half a dozen times I tried. May
have had a syntax error in my master-solrconfig.xml file, even though
the Solr log files didn't report any, maybe when there's a syntax error
Solr just silently gives up on the config file and presents an empty
index, I dunno.

On 2/28/2011 3:46 PM, Jonathan Rochkind wrote:

Yeah, I'm actually _not_ trying to get replication to copy over the
config files.  Instead, I'm assuming the config files are all there, and
I'm actually trying to get one of the cores to _use_ a file that
actually on disk in that core is called, eg, solrconfig_slave.xml.

This wiki page: http://wiki.apache.org/solr/CoreAdmin

suggests I _ought_ to be able to do that, to tell a particular core to
use a config file of any name I want. But I'm having trouble getting it
to work. But that could be my own local mistake of some kind too. Just
makes it harder to figure out when I'm not even exactly sure how you're
_supposed_ to be able to do that -- CoreAdmin wiki page implies at least
two different ways you should be able to do it, but doesn't include an
actual example so I'm not sure if I'm understanding what it's implying
correctly -- or if the actual 1.4.1 behavior matches what's in that wiki
page anyway.

On 2/28/2011 3:14 PM, Dyer, James wrote:

Jonathan,

When I was first setting up replication a couple weeks ago, I had this working, 
as described here: 
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml

I created the slave's solrconfig.xml and saved it on the master in the conf dir as 
solrconfig_slave.xml, then began the confFiles parameter on the master with 
solrconfig_slave.xml:solrconfig.xml,schema.xml,etc.  And it was working (v1.4.1).  I'm not sure why you haven't had 
good luck with this but you can at least know it is possible to get it to work.

I think to get the slave up and running for the first time I saved the slave's version on the slave as 
solrconfig.xml.  It then would copy over any changed versions of solrconfig_slave.xml 
from the master to the slave, saving them on the slave as solrconfig.xml.  But I primed it by 
giving it its config file in-sync to start with.

I ended up going the same-config-file-everywhere route though because we're 
using our master to handle requests when its not indexing (one less server to 
buy)...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Monday, February 28, 2011 2:03 PM
To: solr-user@lucene.apache.org
Subject: Re: setting different solrconfig.xml for a core

Okay, I did manage to find a clue from the log that it's not working,
when it's not working:

INFO: Jk running ID=0 time=0/66  config=null

config=null, that's not right.  When I try to over-ride the config file
name in solr.xml core config, I can't seem to put a name in there that
works to find a file that does actually exist.  Unless I put the name
solrconfig.xml in there, then it works fine, heh.



On 2/28/2011 3:00 PM, Jonathan Rochkind wrote:

On 2/28/2011 1:09 PM, Ahmet Arslan wrote:

(The reason I want to do this is so I can have master and
slave in replication have the exact same repo checkout for
their conf directory, but have the master using a different
solrconfig.xml, one set up to be master.)

How about using same solrconfig.xml for master too? As described here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node


That isn't great, becuase there are more differences in optimal
solrconfig.xml between master and slave than just the replication
handler difference, which that URL covers.

A master (which won't be queried against) doesn't need spellcheck
running after commits, but the slave does. A master doesn't need slow
newsearcher/firstsearcher query warm-ups, but the slave does. The master
may be better

Re: setting different solrconfig.xml for a core

2011-02-28 Thread Jonathan Rochkind
Hmm, I'm pretty sure I'm seeing that listener can take an 'enable' 
attribute too.  Even though that's not a searchComponent or a 
requestComponent, is it?


After toggling enable back on forth on a listener and restarting Solr 
and watching my logs closely, I am as confident as I can be that it 
mysteriously is being respected on listener. Go figure.


Convenient for me, because I wanted to disable my expensive and 
timeconsuming newSearcher/firstSearcher warming queries on a core marked 
'master'.


On 2/28/2011 4:21 PM, Dyer, James wrote:

Just did a quick search for ' enable= ' in the 1.4.1 source.  Looks like from the example solrconfig.xml, 
bothsearchComponent  andrequestHandler  tags can take the enable attribute.  Its only shown with 
the ClusteringComponent so I'm not sure if just any SC or RH will honor it.  Also see the unit test 
TestPluginEnable.java, which seems to show that the StandardRequestHandler will honor it.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Monday, February 28, 2011 3:09 PM
To: solr-user@lucene.apache.org
Subject: Re: setting different solrconfig.xml for a core

And in other news of other possibilities. If I DID want to use the same
solrconfig.xml for both master and slave, but disable the
newsearcher/firstsearcher queries on master, it _looks_ like I can use
the techique here:

http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node

Applied to newsearcher/firstsearcher too:

listener event=firstSearcher class=solr.QuerySenderListener
enable=${enable.slave:false}

Now that listener will only be turned on if enable.slave is set to true.
Might make more sense to use a different property value there, like
enable.searcher or something.

I'm not entirely sure in what places the enable attribute is
recognized and in what places it isn't, but it LOOKS like it's
recognized on thelistener  tag.  I think.


On 2/28/2011 3:52 PM, Jonathan Rochkind wrote:

Aha, wait, I think I've made it work, as simple as this in the solr.xml
core config, to make a core use a solrconfig.xml file with a different name:

...core name=master_prod instanceDir=master_prod
config=master-solrconfig.xml   ...

Not sure why that didn't work the first half a dozen times I tried. May
have had a syntax error in my master-solrconfig.xml file, even though
the Solr log files didn't report any, maybe when there's a syntax error
Solr just silently gives up on the config file and presents an empty
index, I dunno.

On 2/28/2011 3:46 PM, Jonathan Rochkind wrote:

Yeah, I'm actually _not_ trying to get replication to copy over the
config files.  Instead, I'm assuming the config files are all there, and
I'm actually trying to get one of the cores to _use_ a file that
actually on disk in that core is called, eg, solrconfig_slave.xml.

This wiki page: http://wiki.apache.org/solr/CoreAdmin

suggests I _ought_ to be able to do that, to tell a particular core to
use a config file of any name I want. But I'm having trouble getting it
to work. But that could be my own local mistake of some kind too. Just
makes it harder to figure out when I'm not even exactly sure how you're
_supposed_ to be able to do that -- CoreAdmin wiki page implies at least
two different ways you should be able to do it, but doesn't include an
actual example so I'm not sure if I'm understanding what it's implying
correctly -- or if the actual 1.4.1 behavior matches what's in that wiki
page anyway.

On 2/28/2011 3:14 PM, Dyer, James wrote:

Jonathan,

When I was first setting up replication a couple weeks ago, I had this working, 
as described here: 
http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml

I created the slave's solrconfig.xml and saved it on the master in the conf dir as 
solrconfig_slave.xml, then began the confFiles parameter on the master with 
solrconfig_slave.xml:solrconfig.xml,schema.xml,etc.  And it was working (v1.4.1).  I'm not sure why you haven't had 
good luck with this but you can at least know it is possible to get it to work.

I think to get the slave up and running for the first time I saved the slave's version on the slave as 
solrconfig.xml.  It then would copy over any changed versions of solrconfig_slave.xml 
from the master to the slave, saving them on the slave as solrconfig.xml.  But I primed it by 
giving it its config file in-sync to start with.

I ended up going the same-config-file-everywhere route though because we're 
using our master to handle requests when its not indexing (one less server to 
buy)...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Monday, February 28, 2011 2:03 PM
To: solr-user@lucene.apache.org
Subject: Re: setting different solrconfig.xml for a core

Okay, I did manage to find a clue from the log that it's not working,
when it's

suggestion: do not require masterUrl for slave config

2011-02-28 Thread Jonathan Rochkind
Suggestion, curious what other people think of it, if I should bother 
filing a JIRA and/or trying to come up with a patch.


Currently, when you configure a replication lst name=slave, you HAVE 
to give it a masterUrl.


SEVERE: org.apache.solr.common.SolrException: 'masterUrl' is required 
for a slave

at org.apache.solr.handler.SnapPuller.init(SnapPuller.java:126)
at 
org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler.java:775)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508)

at org.apache.solr.core.SolrCore.init(SolrCore.java:588)


At first this makes sense, why would you want a slave without a 
masterUrl?  But since you can supply the masterUrl as a query parameter 
in /replication?command=fetchIndexmasterUrl=X, there's really no reason 
to require you to specify it in the solrconfig.xml, if you are planning 
on not having automatic polling, but just triggering replication 
manually, and supplying the masterUrl in the command every time.   This 
can sometimes be convenient for letting some other monitor process 
decide when and how to replicate, instead of having solr itself be 
configured for pulling via polling.


Does that make any sense?




multi-core solr, specifying the data directory

2011-02-28 Thread Jonathan Rochkind
Unless I'm doing something wrong, in my experience in multi-core Solr in 
1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir.


I set up multi-core like this:

cores adminPath=/admin/cores
core name=some_core instanceDir=some_core
/core
/cores


Now, setting instanceDir like that works for Solr to look for the 'conf' 
directory in the default location you'd expect, ./some_core/conf.


You'd expect it to look for the 'data' dir for an index in 
./some_core/data too, by default.  But it does not seem to. It's still 
looking for the 'data' directory in the _main_ solr.home/data, not under 
the relevant core directory.


The only way I can manage to get it to look for the /data directory 
where I expect is to spell it out with a full absolute path:


core name=some_core instanceDir=some_core
property name=dataDir value=/path/to/main/solr/some_core/data /
/core

And then in the solrconfig.xml do a dataDir${dataDir}/dataDir

Is this what everyone else does too? Or am I missing a better way of 
doing this?  I would have thought it would just work, with Solr by 
default looking for a ./data subdir of the specified instanceDir.  But 
it definitely doesn't seem to do that.


Should it? Anyone know if Solr in trunk past 1.4.1 has been changed to 
do what I expect? Or am I wrong to expect it? Or does everyone else do 
multi-core in some different way than me where this doesn't come up?


Jonathan



RE: Disabling caching for fq param?

2011-02-28 Thread Jonathan Rochkind
As far as I know there is not, it might be beneficial, but also worth 
considering: thousands of users isn't _that_ many, and if that same clause is 
always the same per user, then if the same user does a query a second time, it 
wouldn't hurt to have their user-specific fq in the cache.  A single fq cache 
may not take as much RAM as you think, you could potentially afford increase 
your fq cache size to thousands/tens-of-thousands, and win all the way around. 

The filter cache should be a least-recently-used-out-first cache, so even if 
the filter cache isn't big enough for all of them, fq's that are used by more 
than one user will probably stay in the cache as old user-specific fq's end up 
falling off the back as least-recently-used. 

So in actual practice, one way or another, it may not be a problem. 

From: mrw [mikerobertsw...@gmail.com]
Sent: Monday, February 28, 2011 9:06 PM
To: solr-user@lucene.apache.org
Subject: Disabling caching for fq param?

Based on what I've read here and what I could find on the web, it seems that
each fq clause essentially gets its own results cache.  Is that correct?

We have a corporate policy of passing the user's Oracle OLS labels into the
index in order to be matched against the labels field.  I currently separate
this from the user's query text by sticking it into an fq param...

?q=user-entered expression
fq=labels:the label values expression
qf=song metadata copy field song lyrics field
tie=0.1
defType=dismax

...but since its value (a collection of hundreds of label values) only apply
to that user, the accompanying result set won't be reusable by other users:

My understanding is that this query will result in two result sets (q and
fq) being cached separately, with the union of the two sets being returned
to the user.  (Is that correct?)

There are thousands of users, each with a unique combination of labels, so
there seems to be little value in caching the result set created from the fq
labels param.  It would be beneficial if there were some kind of fq
parameter override to indicate to Solr to not cache the results?


Thanks!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Disabling-caching-for-fq-param-tp2600188p2600188.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: query results filter

2011-02-24 Thread Jonathan Rochkind
Hmm, depending on what you are actually needing to do, can you do it with a 
simple fq param to filter out what you want filtered out, instead of needing to 
write custom Java as you are suggesting? It would be a lot easier to just use 
an fq. 

How would you describe the documents you want to filter from the query results 
page?  Can that description be represented by a Solr query you can already 
represent using the lucene, dismax, or any other existing query? If so, why not 
just use a negated fq describing what to omit from the results?

From: Babak Farhang [farh...@gmail.com]
Sent: Thursday, February 24, 2011 6:58 PM
To: solr-user
Subject: query results filter

Hi everyone,

I have some existing solr cores that for one reason or another have
documents that I need to filter from the query results page.

I would like to do this inside Solr instead of doing it on the
receiving end, in the client.  After searching the mailing list
archives and Solr wiki, it appears you do this by registering a custom
SearchHandler / SearchComponent with Solr.  Still, I don't quite
understand how this machinery fits together.  Any suggestions / ideas
/ pointers much appreciated!

Cheers,
-Babak

~~

Ideally, I'd like to find / code a solution that does the following:

1. A request handler that works like the StandardRequestHandler but
which allows an optional DocFilter (say, modeled like the
java.io.FileFilter interface)
2. Allows current pagination to work transparently.
3. Works transparently with distributed/sharded queries.


RE: Best way for a query-expander?

2011-02-19 Thread Jonathan Rochkind
I don't think there's any way to do this in Solr, although you could write your 
own query parser in Java if you wanted to. 

You can set defaults , invariants  and appends values on your request 
handler, but I don't think that's flexible enough to do what you want. 
http://wiki.apache.org/solr/SearchHandler

In general, to my perspective, Solr seems to be written assuming a trusted 
client.  If you are allowing access to untrusted clients, there are probably 
all sorts of things a client can do that you woudln't want them to, writing 
your own query parser might be a good idea. 

From: Paul Libbrecht [p...@hoplahup.net]
Sent: Saturday, February 19, 2011 11:01 AM
To: solr-user@lucene.apache.org
Subject: Re: Best way for a query-expander?

Hello list,

as Hoss suggests, I'll try to be more detailed.

I wish to use http parameters in my requests that define the precise semantic 
of an advanced search.
For example, if I can see from sessions, that a given user is requesting, that 
not only public resources but resources private-to-him are returned.
For example, if there's a parameter ict, I want to expand the query with an 
extra (mandatory) term-query.

I know I could probably do this at the client level but I do not think this is 
the best way, in particular about the access to private resources... I also 
think it's better to not rely too heavily on client's ability to formula 
string-queries since it allows all sorts of tweaking that one may not wish 
possible, in particular for queries that are service oriented.

paul


Le 19 févr. 2011 à 01:18, Chris Hostetter a écrit :


 : I want to implement a query-expander, one that enriches the input by the
 : usage of extra parameters that, for example, a form may provide.
 :
 : Is the right way to subclass SearchHandler?
 : Or rather to subclass QueryComponent?

 This smells like the poster child for an X/Y problem
 (or maybe an X/(Y OR Z) problem)...

 if you can elaborate a bit more on the type of enrichment you want to do,
 it's highly likely that your goal can be met w/o needing to write a custom
 plugin (i'm thinking particularly of the multitudes of parsers solr
 already has, local params, and variable substitution)

 http://people.apache.org/~hossman/#xyproblem
 XY Problem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341


 -Hoss



Re: GET or POST for large queries?

2011-02-17 Thread Jonathan Rochkind
Yes, I think it's 1024 by default.  I think you can raise it in your 
config. But your performance may suffer.


Best would be to try and find a better way to do what you want without 
using thousands of clauses. This might require some custom Java plugins 
to Solr though.


On 2/17/2011 3:52 PM, mrw wrote:

Yeah, I tried switching to POST.

It seems to be handling the size, but apparently Solr has a limit on the
number of boolean comparisons -- I'm now getting too many boolean clauses
errors emanating from

org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108).
:)


Thanks for responding.



Erik Hatcher-4 wrote:

Yes, you may use POST to make search requests to Solr.

Erik




optimize and mergeFactor

2011-02-16 Thread Jonathan Rochkind
In my own Solr 1.4, I am pretty sure that running an index optimize does 
give me significant better performance. Perhaps because I use some 
largeish (not huge, maybe as large as 200k) stored fields.


So I'm interested in always keeping my index optimized.

Am I right that if I set mergeFactor to '1', essentially my index will 
always be optimized after every commit, and actually running 'optimize' 
will be redundant?


What are the possible negative repurcussions of setting mergeFactor to 
1? Is this a really bad idea?  If not 1, what about some other 
lower-than-usually-recommended value like 2 or 3?  Anyone done this?
I imagine it will slow down my commits, but if the alternative is 
running optimize a lot anyway I wonder at what point I get 'break 
even' (if I optimize after every single commit, clearly might as well 
just set the mergeFactor low, right? But if I optimize after every X 
documents or Y commits don't know what X/Y are break-even).


Jonathan


Re: optimize and mergeFactor

2011-02-16 Thread Jonathan Rochkind

Thanks for the answers, more questions below.

On 2/16/2011 3:37 PM, Markus Jelsma wrote:



200.000 stored fields? I asume that number includes your number of documents?
Sounds crazy =)


Nope, I wasn't clear. I have less than a dozen stored field, but the 
value of a stored field can sometimes be as large as 200kb.




You can set mergeFactor to 2, not lower.


Am I right though that manually running an 'optimize' is the equivalent 
of a mergeFactor=1?  So there's no way to get Solr to keep the index in 
an 'always optimized' state, if I'm understanding correctly? Cool. Just 
want to understand what's going on.



This depends on commit rate and if there are a lot of updates and deletes
instead of adds. Setting it very low will indeed cause a lot of merging and
slow commits. It will also be very slow in replication because merged files are
copied over again and again, causing high I/O on your slaves.

There is always a `break even` but it depends (as usual) on your scenario and
business demands.



There are indeed sadly lots of updates and deletes, which is why I need 
to run optimize periodically. I am aware that this will cause more work 
for replication -- I think this is true whether I manually issue an 
optimize before replication _or_ whether I just keep the mergeFactor 
very low, right? Same issue either way.


So... if I'm going to do lots of updates and deletes, and my other 
option is running an optimize before replication anyway   is there 
any reason it's going to be completely stupid to set the mergeFactor to 
2 on the master?  I realize it'll mean all index files are going to have 
to be replicated, but that would be the case if I ran a manual optimize 
in the same situation before replication too, I think.


Jonathan


Re: Solr multi cores or not

2011-02-16 Thread Jonathan Rochkind
Solr multi-core essentially just lets you run multiple seperate distinct 
Solr indexes in the same running Solr instance.


It does NOT let you run queries accross multiple cores at once. The 
cores are just like completely seperate Solr indexes, they are just 
conveniently running in the same Solr instance. (Which can be easier and 
more compact to set up than actually setting up seperate Solr instances. 
And they can share some config more easily. And it _may_ have 
implications on JVM usage, not sure).


There is no good way in Solr to run a query accross multiple Solr 
indexes, whether they are multi-core or single cores in seperate Solr 
doesn't matter.


Your first approach should be to try and put all the data in one Solr 
index. (one Solr 'core').


Jonathan

On 2/16/2011 3:45 PM, Thumuluri, Sai wrote:

Hi,

I have a need to index multiple applications using Solr, I also have the
need to share indexes or run a search query across these application
indexes. Is solr multi-core - the way to go?  My server config is
2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
recommendation?

Thanks,
Sai Thumuluri





minimum Solr slave replication config

2011-02-16 Thread Jonathan Rochkind
Solr 1.4.1. So, from the documentation at 
http://wiki.apache.org/solr/SolrReplication


I was wondering if I could get away without having any actual 
configuration in my slave at all. The replication handler is turned on, 
but if I'm going to manually trigger replication pulls while supplying 
the master URL manually with the command too, by:


command=fetchIndexmasterUrl=$solr_master

Then I was thinking, gee, maybe I don't need any slave config at all. 
That _appears_ to not be true. In such a situation, when I tell the 
slave to fetchIndexmasterUrl=$solr_master, the command gives a 200 OK.


But then I go and check /replication?command=details on the slave, I'm 
actually presented with an exception:


message null java.lang.NullPointerException at 
org.apache.solr.handler.ReplicationHandler.isPollingDisabled(ReplicationHandler.java:412) 
at


So I'm thinking this is probably becuase you actually can't get away 
with no slave config at all.


So:

1) Is this a bug? Maybe I did something I shoudn't have, but having 
command=details report a NullPointerException is probably not good, 
right?  If someone who knows better agrees, I'll file it in JIRA?


2) Does anyone know what the minimal slave config is?  If I plan to 
manually trigger replication pulls, and supply the masterUrl maybe 
just an empty lst name=slave/lst.  Or are there other parameters I 
have to set even though I don't plan to use them? (I do not want 
automatic polling, only manually triggered pulls).  Anyone have any 
advice, or should I just trial and error?





Re: Solr multi cores or not

2011-02-16 Thread Jonathan Rochkind
Yes, you're right, from now on when I say that, I'll say except 
shards. It is true.


My understanding is that shards functionality's intended use case is for 
when your index is so large that you want to split it up for 
performance. I think it works pretty well for that, with some 
limitations as you mention.


From reading the list, my impression is that when people try to use 
shards to solve some _other_ problem, they generally run into problems. 
But maybe that's just because the people with the problems are the ones 
who appear on the list?


My personal advice is still to try and put everything together in one 
big index, Solr will give you the least trouble with that, it's what 
Solr likes to do best;  move to shards certainly if your index is so 
large that moving to shards will give you performance advantage you 
need, that's what they're for; be very cautious moving to shards for 
other challenges that 'one big index' is giving you that you're thinking 
shards will solve. Shards is, as I understand it, _not_ intended as a 
general purpose federation function, it's specifically intended to 
split an index accross multiple hosts for performance.


Jonathan

On 2/16/2011 4:37 PM, Bob Sandiford wrote:

Hmmm.  Maybe I'm not understanding what you're getting at, Jonathan, when you 
say 'There is no good way in Solr to run a query across multiple Solr indexes'.

What about the 'shards' parameter?  That allows searching across multiple cores 
in the same instance, or shards across multiple instances.

There are certainly implications here (like Relevance not being consistent 
across cores / shards), but it works pretty well for us...

Thanks!

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com




-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Wednesday, February 16, 2011 4:09 PM
To: solr-user@lucene.apache.org
Cc: Thumuluri, Sai
Subject: Re: Solr multi cores or not

Solr multi-core essentially just lets you run multiple seperate
distinct
Solr indexes in the same running Solr instance.

It does NOT let you run queries accross multiple cores at once. The
cores are just like completely seperate Solr indexes, they are just
conveniently running in the same Solr instance. (Which can be easier
and
more compact to set up than actually setting up seperate Solr
instances.
And they can share some config more easily. And it _may_ have
implications on JVM usage, not sure).

There is no good way in Solr to run a query accross multiple Solr
indexes, whether they are multi-core or single cores in seperate Solr
doesn't matter.

Your first approach should be to try and put all the data in one Solr
index. (one Solr 'core').

Jonathan

On 2/16/2011 3:45 PM, Thumuluri, Sai wrote:

Hi,

I have a need to index multiple applications using Solr, I also have

the

need to share indexes or run a search query across these application
indexes. Is solr multi-core - the way to go?  My server config is
2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
recommendation?

Thanks,
Sai Thumuluri








Re: Multicore boosting to only 1 core

2011-02-15 Thread Jonathan Rochkind
No. In fact, there's no way to search over multi-cores at once in Solr 
at all, even before you get to your boosting question. Your different 
cores are entirely different Solr indexes, Solr has no built-in way to 
combine searches accross multiple Solr instances.


[Well, sort of it can, with sharding. But sharding is unlikely to be a 
solution to your problem either, UNLESS you problem is that your solr 
index is so big you want to split it accross multiple machines for 
performance.  That is the problem sharding is meant to solve. People 
trying to use it to solve other problems run into trouble.]



On 2/14/2011 1:59 PM, Tanner Postert wrote:

I have a multicore system and I am looking to boost results by date, but
only for 1 core. Is this at all possible?

Basically one of the core's content is very new, and changes all the time,
and if I boost everything by date, that core's content will almost always be
at the top of the results, so I only want to do the date boosting to the
cores that have older content so that their more recent results get boosted
over the older content.


Re: schema.xml configuration for file names?

2011-02-15 Thread Jonathan Rochkind
You can't just send arbitrary XML to Solr for update, no.  You need to 
send a Solr Update Request in XML. You can write software that 
transforms that arbitrary XML to a Solr update request, for simple cases 
it could even just be XSLT.  There are also a variety of other mediator 
pieces that come with Solr for doing updates; you can send updates in 
comma-seperated-value format, or you can use Direct Import Handler to, 
in some not-too-complicated cases, embed the translation from your 
arbitrary XML to Solr documents in your Solr instance itself.


But you can't just send arbitrary XML to the Solr update handler, no.

No matter what method you use to send documents to solr, you're going to 
have to think about what you want your Solr schema to look like -- what 
fields of what types.  And then map your data to it.  In Solr, unlike in 
an rdbms, what you want your schema to look like has a lot to do with 
what kinds of queries you will want it to support, it can't just be done 
based on the nature of the data alone.


Jonathan

On 2/15/2011 12:45 PM, alan bonnemaison wrote:

Erick,

I think you put the finger on the problem. Our XML files (we get from our
suppliers) do *not* look like that.

That's what a typical file looks like

insert_list...resultresult
outcome=PASS/resultparameter_liststring_parameter name=SN
value=NOVAL /string_parameter name=RECEIVER value=000907010391
/string_parameter name=Model value=R16-500 /...string_parameter
name=WorkCenterID value=PREP /string_parameter name=SiteID
value=CTCA /string_parameter name=RouteID value=ADV
/string_parameter name=LineID value=Line5 //parameter_listconfig
enable_sfcs_comm=true enable_param_db_comm=false
force_param_db_update=false driver_platform=LABVIEW mode=PROD
driver_revision=2.0/config/insert_list

Obviously, nothing likeadddoc/doc/add

By the way, querying q=*:* retrieved HTTP error 500 Null pointer
exception, which leads me to believe that my index is 100% empty.

What I am trying to do cannot be done, correct? I just don't want to waste
anyone's time.

Thanks,

Alan.


On Tue, Feb 15, 2011 at 6:01 AM, Erick Ericksonerickerick...@gmail.comwrote:


Can we see a small sample of an xml file you're posting? Because it should
look something like
add
   doc
 field name=stbmodelR16-500/field
more fields here.
   /doc
/add

Take a look at the Solr admin page after you've indexed data to see what's
actually in your index, I suspect what's in there isn't what you
expect.

Try querying q=*:* just for yucks to see what the documents returned look
like.

I suspect your index doesn't contain anything like what you think, but
that's only
a guess...

Best
Erick

On Mon, Feb 14, 2011 at 7:15 PM, alan bonnemaisonkg6...@gmail.com
wrote:

Hello!

We receive from our suppliers hardware manufacturing data in XML files.

On a

typical day, we got 25,000 files. That is why I chose to implement Solr.

The file names are made of eleven fields separated by tildas like so



CTCA~PRE~PREP~1010123~ONTDTVP5A~41~P~R16-500~000912239878~20110125~212321.XML

Our RD guys want to be able search each field of the file XML file names
(OR operation) but they don't care to search the file contents. Ideally,
they would like to do a query all files where stbmodel equal to

R16-500

or result is P or filedate is 20110125...you get the idea.

I defined in schema.xml each data field like so (from left to right --

sorry

for the long list):

   field name=location   type=textgen  indexed=false
stored=true   multiValued=false/
   field name=scriptid   type=textgen  indexed=false
stored=true   multiValued=false/
   field name=slotid type=textgen  indexed=false
stored=true   multiValued=false/
   field name=workcenter type=textgen  indexed=false
stored=false  multiValued=false/
   field name=workcenterid   type=textgen  indexed=false
stored=fase   multiValued=false/
   field name=result type=string   indexed=true
stored=truemultiValued=false/
   field name=computerid type=textgen  indexed=false
stored=true   multiValued=false/
   field name=stbmodel   type=textgen  indexed=true
stored=truemultiValued=false/
   field name=receiver   type=string   indexed=true
stored=truemultiValued=false/
   field name=filedate   type=textgen  indexed=false
stored=true   multiValued=false/
   field name=filetime   type=textgen  indexed=false
stored=true   multiValued=false/

Also, I defined as unique key the field receiver. But no results are
returned by my queries. I made sure to update my index like so: java

-jar

apache-solr-1.4.1/example/exampledocs/post.jar *XML.

I am obviously missing something. Is there a way to configure schema.xml

to

search for file names? I welcome your input.

Al.






RE: Concurrent updates/commits

2011-02-09 Thread Jonathan Rochkind
Solr does handle concurrency fine. But there is NOT transaction isolation 
like you'll get from an rdbms. All 'pending' changes are (conceptually, anyway) 
held in a single queue, and any commit will commit ALL of them. There isn't 
going to be any data corruption issues or anything from concurrent adds (unless 
there's a bug in Solr, there isn't supposed to be) -- but there is no kind of 
transactions or isolation between different concurrent adders. So, sure, 
everyone can add concurrently -- but any time any of those actors issues a 
commit, all pending adds are committed. 

In addition, there are problems with Solr's basic architecture and _too 
frequent_ commits (whether made by different processes or not, doesn''t 
matter). When a new commit happens, Solr fires up a new index searcher and 
warms it up on the new version of the index. Until the new index searcher is 
fully warmed, the old index searcher is still serving queries.  Which can also 
mean that there are, for this period, TWO versions of all your caches in RAM 
and such. So let's say it takes 5 minutes for the new index to be fully warmed. 
 But if you have commits happening every 1 minute -- then you'll end up with 
FIVE 'new indexes' being warmed -- meaning potentially 5 times the RAM usage 
(quickly running into a JVM out of memory error), lots of CPU activity going on 
warming indexes that will never actually been used (because even though they 
aren't even done being warmed and ready to use, they've already been superseded 
by a later commit).   

I don't know of any good way to deal with this except less frequent commits. 
One way to get less frequent commits is to use Solr replication, and 'stage' 
all your commits in a 'master' index, but only replicate to 'slave' at a 
frequency slow enough so the new index is fully warmed before the next commit 
happens. 

Some new features in trunk (both lucene and solr) for 'near real time'  search 
ameliorate this problem somewhat, depending on the nature of your commits. 

Jonathan

From: Savvas-Andreas Moysidis [savvas.andreas.moysi...@googlemail.com]
Sent: Wednesday, February 09, 2011 10:34 AM
To: solr-user@lucene.apache.org
Subject: Concurrent updates/commits

Hello,

This topic has probably been covered before here, but we're still not very
clear about how multiple commits work in Solr.
We currently have a requirement to make our domain objects searchable
immediately after the get updated in the database by some user action. This
could potentially cause multiple updates/commits to be fired to Solr and we
are trying to investigate how Solr handles those multiple requests.

This thread:
http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search

suggests that Solr will handle all of the lower level details and that Before
a *COMMIT* is done , lock is obtained and its released  after the
operation
which in my understanding means that Solr will serialise all update/commit
requests?

However, the Solr book, in the Commit, Optimise, Rollback section reads:
if more than one Solr client were to submit modifications and commit them
at similar times, it is possible for part of one client's set of changes to
be committed before that client told Solr to commit
which suggests that requests are *not* serialised.

Our questions are:
- Does Solr handle concurrent requests or do we need to add synchronisation
logic around our code?
- If Solr *does* handle concurrent requests, does it serialise each request
or has some other strategy for processing those?


Thanks,
- Savvas


RE: relational db mapping for advanced search

2011-02-08 Thread Jonathan Rochkind
I have no great answer for you, this is to me a generally unanswered question, 
hard to do Solr with this sort of thing, I think you seem to understand it 
properly. 

There ARE some interesting new features in trunk (not 1.4) that may be 
relevant, although to my perspective none of them provide magic bullet 
solutions. But there is a 'join' feature which could be awfully useful with the 
setup you suggest of having different 'types' of documents all together in the 
same index. 

https://issues.apache.org/jira/browse/SOLR-2272

From: Scott Yeadon [scott.yea...@anu.edu.au]
Sent: Tuesday, February 08, 2011 4:41 PM
To: solr-user@lucene.apache.org
Subject: relational db mapping for advanced search

Hi,

I was just after some advice on how to map some relational metadata to a
solr index. The web application I'm working on is based around people
and the searching based around properties of these people. Several
properties are more complex - for example, a person's occupations have
place, from/to dates and other descriptive text; texts about a person
have authors, sources and publication dates. Despite the usefulness of
facets and the search-based navigation, an advanced search feature is a
non-negotiable required feature of the application.

An advanced search needs to be able to query a person on any set of
attributes (e.g. gender, birth date, death date, place of birth) etc
including the more complex search criteron as described above
(occupation, texts). Taking occupation as an example, because occupation
has its own metadata and a person could have worked an arbitrary number
of occupations throughout their lifetime, I was wondering how/if this
information can be denormalised into a single person index document to
support such a search. I can't use text concatenation in a multivalued
field as I need to be able to run date-based range queries (e.g.
publication dates, occupation dates). And I'm not sure that resorting to
multiple repeated fields based on the current limits (e.g. occ1,
occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach
(although that would work).

If there isn't a sensible way to denormalise this, what is the best
approach? For example, should I have an occupation document type, a
person document type, a text/source document type and (in an advanced
search context) each containing the relevant person id and (in the
advanced search context) run a query against each document type and then
use the intersecting set of person ids as the result used by the
application for its display/pagination? And if so, how do I ensure I
capture all records - for example if there are 100,000 hits on someone
having worked in Australia in 1956, is there any way to ensure all
100,000 are returned in a query (similar to the facet.limit = -1) other
than specifying an arbitrary high number in the rows parameter and
hoping a query doesn't hit more than 100,000 and thus exclude those
above the limit from the intersect processing?

Or is there a single query solution?

Any advice/hints welcome.

Scott.


RE: prices

2011-02-04 Thread Jonathan Rochkind
Your prices are just dollars and cents? For actual queries, you might consider 
an int type rather than a float type.  Multiple by a hundred to put it in the 
index, then multiply your values in queries by a hundred before putting them in 
the query.  Same for range facetting, just divide by 100 before display of 
anything you get back. 

Fixed precision values like price values aren't really floats or don't really 
need floats, and floats sometimes do weird things, as you've noticed. 

Alternately if your problem is simply that you want to display 2.0 as 2.00 
rather than 2 or 2.0, that is something for you to take care of in your PHP 
app that does the display. PHP will have some function for formatting numbers 
and saying with what precision you want to display. 

There is no way to keep two trailing zeroes 'in' a float field, because 2.0 
or 2. is the same value as 2.00 or 2.00, so they've all got the same 
internal representation in the float field. There is no way I know to tell Solr 
what precision to render floats with in it's responses. 


From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Friday, February 04, 2011 1:49 PM
To: solr-user@lucene.apache.org
Subject: Re: prices

On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 Using solr 1.4.

 I have a price in my schema. Currently it's a tfloat. Somewhere along the way
 from php, json, solr, and back, extra zeroes are getting truncated along with
 the decimal point for even dollar amounts.

 So I have two questions, neither of which seemed to be findable with google.

 A/ Any way to keep both zeroes going inito a float field? (In the analyzer, 
 with
 XML output, the values are shown with 1 zero)
 B/ Can strings be used in range queries like a float and work well for prices?

You could do a copyField into a stored string field and use the tfloat
(or tint and store cents)
for range queries, searching, etc, and the string field just for display.

-Yonik
http://lucidimagination.com





  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: chaning schema

2011-02-03 Thread Jonathan Rochkind
It could be related Tomcat.  I've had inconsistent experiences there 
too, I _thought_ I could delete just the contents of the data/ 
directory, but at some point I realized that wasn't working, confusing 
me as to whether I was remembering correctly that deleting just the 
contents ever worked.   At the moment, on my setup, I definitely need to 
delete the whole data/ directory .


At one point I switched my setup from jetty to tomcat, but at about the 
same point I switched my setup from single core to multi-core too. So it 
could be a multi-core thing too (which seems somewhat more likely than 
jetty vs tomcat making a difference). Or it could be something 
completely else that none of us know, I just report my limited 
observations from experience. :)


Jonathan

On 2/3/2011 8:17 AM, Erick Erickson wrote:

Erik:

Is this a Tomcat-specific issue? Because I regularly delete just the
data/index directory on my Windows
box running Jetty without any problems. (3_x and trunk)

Mostly want to know because I just encouraged someone to just delete the
index dir based on my
experience...

Thanks
Erick

On Tue, Feb 1, 2011 at 12:24 PM, Erik Hatchererik.hatc...@gmail.comwrote:


the trick is, you have to remove the data/ directory, not just the
data/index subdirectory.  and of course then restart Solr.

or delete *:*?commit=true, depending on what's the best fit for your ops.

Erik

On Feb 1, 2011, at 11:41 , Dennis Gearon wrote:


I tried removing the index directory once, and tomcat refused to sart up

because

it didn't have a segments file.




- Original Message 
From: Erick Ericksonerickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tue, February 1, 2011 5:04:51 AM
Subject: Re: chaning schema

That sounds right. You can cheat and just removesolr_home/data/index
rather than delete *:* though (you should probably do that with the Solr
instance stopped)

Make sure to remove the directory index as well.

Best
Erick

On Tue, Feb 1, 2011 at 1:27 AM, Dennis Gearongear...@sbcglobal.net

wrote:

Anyone got a great little script for changing a schema?

i.e., after changing:
database,
the view in the database for data import
the data-config.xml file
the schema.xml file

I BELIEVE that I have to run:
a delete command for the whole index *:*
a full import and optimize

This all sound right?

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually

a

better
idea to learn from others’ mistakes, so you do not have to make them
yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.






Re: OAI on SOLR already done?

2011-02-02 Thread Jonathan Rochkind
The trick is that you can't just have a generic black box OAI-PMH 
provider on top of any Solr index. How would it know where to get the 
metadata elements it needs, such as title, or last-updated date, etc. 
Any given solr index might not even have this in stored fields -- and a 
given app might want to look them up from somewhere other than stored 
fields.


If the Solr index does have them in stored fields, and you do want to 
get them from the stored fields, then it's, I think (famous last words) 
relatively straightforward code to write. A mapping from solr stored 
fields to metadata elements needed for OAI-PMH, and then simply 
outputting the XML template with those filled in.


I am not aware of anyone that has done this in a 
re-useable/configurable-for-your-solr tool. You could possibly do it 
solely using the built-in Solr 
JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather 
than as an external Solr client app, or it could be an external Solr 
client app.


This is actually a very similar problem to something someone else asked 
a few days ago Does anyone have an OpenSearch add-on for Solr?  Very 
very similar problem, just with a different XML template for output 
(usually RSS or Atom) instead of OAI-PMH.


On 2/2/2011 3:14 PM, Paul Libbrecht wrote:

Peter,

I'm afraid your service is harvesting and I am trying to look at a PMH provider 
service.

Your project appeared early in the goolge matches.

paul


Le 2 févr. 2011 à 20:46, Péter Király a écrit :


Hi,

I don't know whether it fits to your need, but we are builing a tool
based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest
with OAI-PMH and index the harvested records into Solr. The records is
harvested, processed, and stored into MySQL, then we index them into
Solr. We created some ways to manipulate the original values before
sending to Solr. We created it in a modular way, so you can change
settings in an admin interface or write your own hooks (special
Drupal functions), to taylor the application to your needs. We support
only Dublin Core, and our own FRBR-like schema (called XC schema), but
you can add more schemas. Since this forum is about Solr, and not
applications using Solr, if you interested this tool, plase write me a
private message, or visit http://eXtensibleCatalog.org, or the
module's page at http://drupal.org/project/xc.

Hope this helps,

Péter
eXtensible Catalog

2011/2/2 Paul Libbrechtp...@hoplahup.net:

Hello list,

I've met a few google matches that indicate that SOLR-based servers implement 
the Open Archive Initiative's Metadata Harvesting Protocol.

Is there something made to be re-usable that would be an add-on to solr?

thanks in advance

paul




Re: OAI on SOLR already done?

2011-02-02 Thread Jonathan Rochkind

On 2/2/2011 5:19 PM, Dennis Gearon wrote:

Does something like this work to extract dates, phone numbers, addresses across
international formats and languages?

Or, just in the plain ol' USA?


What are you talking about?  There is nothing discussed in this thread 
that does any 'extracting' of dates, phone numbers or addresses at all , 
whether in international or domestic formats.




RE: DismaxParser Query

2011-01-27 Thread Jonathan Rochkind
Yes, I think nested queries are the only way to do that, and yes, nested 
queries like Daniel's example work (I've done it myself).  I haven't really 
tried to get into understanding/demonstrating _exactly_ how the relevance ends 
up working on the overall master query in such a situation, but it sort of 
works. 

(Just note that Daniel's example isn't quite right, I think you need double 
quotes for the nested _query_, just check the wiki page/blog post on nested 
queries). 

Does eDismax handle parens for order of operation too?  If so, eDismax is 
probably the best/easiest solution, especially if you're trying to parse an 
incoming query from some OTHER format and translate it to something that can be 
sent to Solr, which is what I often do. 

I haven't messed with eDismax myself yet.  Does anyone know if there's any easy 
(easy!) way to get eDismax in a Solr 1.4?  Any easy way to compile an eDismax 
query parser on it's own that works with Solr 1.4, and then just drop it into 
your local lib/ for use with an existing Solr 1.4?

Jonathan


From: Daniel Pötzinger [daniel.poetzin...@aoemedia.de]
Sent: Thursday, January 27, 2011 9:26 AM
To: solr-user@lucene.apache.org
Subject: AW: DismaxParser Query

It may also be an option to mix the query parsers?
Something like this (not tested):

q={!lucene}field1:test OR field2:test2 _query_:{!dismax qf=fields}+my dismax 
-bad

So you have the benefits of lucene and dismax parser

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com]
Gesendet: Donnerstag, 27. Januar 2011 15:15
An: solr-user@lucene.apache.org
Betreff: Re: DismaxParser Query

What version of Solr are you using, and could you consider either 3x or
applying a patch to 1.4.1? Because eDismax (extended dismax) handles the
full Lucene query language and probably works here. See the Solr
JIRA 1553 at https://issues.apache.org/jira/browse/SOLR-1553

Best
Erick

On Thu, Jan 27, 2011 at 8:32 AM, Isan Fulia isan.fu...@germinait.comwrote:

 It worked by making mm=0 (it acted as OR operator)
 but how to handle this

 field1:((keyword1 AND keyword2) OR (keyword3 AND keyword4)) OR
 field2:((keyword1 AND keyword2) OR (keyword3 AND keyword4)) OR
 field3:((keyword1 AND keyword2) OR (keyword3 AND keyword4))




 On 27 January 2011 17:06, lee carroll lee.a.carr...@googlemail.com
 wrote:

  sorry ignore that - we are on dismax here - look at mm param in the docs
  you can set this to achieve what you need
 
  On 27 January 2011 11:34, lee carroll lee.a.carr...@googlemail.com
  wrote:
 
   the default operation can be set in your config to be or or on the
  query
   something like q.op=OR
  
  
  
   On 27 January 2011 11:26, Isan Fulia isan.fu...@germinait.com wrote:
  
   but q=keyword1 keyword2  does AND operation  not OR
  
   On 27 January 2011 16:22, lee carroll lee.a.carr...@googlemail.com
   wrote:
  
use dismax q for first three fields and a filter query for the 4th
 and
   5th
fields
so
q=keyword1 keyword 2
qf = field1,feild2,field3
pf = field1,feild2,field3
mm=something sensible for you
defType=dismax
fq= field4:(keyword3 OR keyword4) AND field5:(keyword5)
   
take a look at the dismax docs for extra params
   
   
   
On 27 January 2011 08:52, Isan Fulia isan.fu...@germinait.com
  wrote:
   
 Hi all,
 The query for standard request handler is as follows
 field1:(keyword1 OR keyword2) OR field2:(keyword1 OR keyword2) OR
 field3:(keyword1 OR keyword2) AND field4:(keyword3 OR keyword4)
 AND
 field5:(keyword5)


 How the same above query can be written for dismax request handler

 --
 Thanks  Regards,
 Isan Fulia.

   
  
  
  
   --
   Thanks  Regards,
   Isan Fulia.
  
  
  
 



 --
 Thanks  Regards,
 Isan Fulia.



Re: How to edit / compile the SOLR source code

2011-01-26 Thread Jonathan Rochkind
[Btw, this is great, thank you so much to Solr devs for providing simple 
ant-based compilation, and not making me install specific development 
tools and/or figure out how to use maven to compile, like certain other 
java projects. Just make sure ant is installed and 'ant dist', I can do 
that!  I more or less know how to write Java, at least for simple 
things,  but I still have trouble getting the right brew of required 
Java dev tools working properly to compile some projects! ]


On 1/26/2011 4:19 PM, Erick Erickson wrote:

Sure, at the top level (above src) you should be able to just type
ant dist, then look in the dist directory ant there should be a
solrversion.war

Best
Erick

On Wed, Jan 26, 2011 at 11:43 AM, Anuraganurag.it.jo...@gmail.com  wrote:


Actually i also want to edit Source Files of Solr.Does that mean i will
have
to go in Src directory of Solr and then rebuild using ant? I need not
compile them or Ant will  do the whole compiling as well as updating the
jar
files?
i have the following files in Solr-1.3.0 directory

/home/anurag/apache-solr-1.3.0/build
/home/anurag/apache-solr-1.3.0/client
/home/anurag/apache-solr-1.3.0/contrib
/home/anurag/apache-solr-1.3.0/dist
/home/anurag/apache-solr-1.3.0/docs
/home/anurag/apache-solr-1.3.0/example
/home/anurag/apache-solr-1.3.0/lib
/home/anurag/apache-solr-1.3.0/src
/home/anurag/apache-solr-1.3.0/build.xml
/home/anurag/apache-solr-1.3.0/CHANGES.txt
/home/anurag/apache-solr-1.3.0/common-build.xml
/home/anurag/apache-solr-1.3.0/KEYS.txt
/home/anurag/apache-solr-1.3.0/LICENSE.txt
/home/anurag/apache-solr-1.3.0/NOTICE.txt
/home/anurag/apache-solr-1.3.0/README.txt

and i want to edit the source code to implement my things. How should i
proceed?

-
Kumar Anurag

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-edit-compile-the-SOLR-source-code-tp477584p2355270.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: in-index representaton of tokens

2011-01-25 Thread Jonathan Rochkind

Why does it matter?  You can't really get at them unless you store them.

I don't know what table per column means, there's nothing in Solr 
architecture called a table or a column. Although by column you 
probably mean more or less Solr field.  There is nothing like a 
table in Solr.


Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:

So, the index is a list of tokens per column, right?

There's a table per column that lists the analyzed tokens?

And the tokens per column are represented as what, system integers? 32/64 bit
unsigned ints?

  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
I haven't figured out any way to achieve that AT ALL without making a 
seperate Solr index just to serve autosuggest queries. At least when you 
want to auto-suggest on a multi-value field. Someone posted a crazy 
tricky way to do it with a single-valued field a while ago.  If you 
can/are willing to make a seperate Solr index with a schema set up for 
auto-suggest specifically, it's easy. But from an existing schema, where 
you want to auto-suggest just based on the values in one field, it's a 
multi-valued field, and you want to allow matches in the middle of the 
field -- I don't think there's a way to do it.


On 1/25/2011 3:03 PM, johnnyisrael wrote:

Hi Eric,

What I want here is, lets say I have 3 documents like

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
Ah, sorry, I got confused about your requirements, if you just want to 
match at the beginning of the field, it may be more possible.  Using 
edgegrams or wildcard. If you have a single-valued field. Do you have a 
single-valued or a multi-valued field?  That is, does each document have 
just one value, or multiple?   I still get confused about how to do it 
with edgegrams, even with single-valued field, but I think maybe it's 
possible.


_Definitely_ possible, with or without edgegrams, if you are 
willing/able to make a completely seperate Solr index where each term 
for auto-suggest is a document.  Yes.


The problem lies in what results are. In general, Solr's results are 
the documents you have in the Solr index. Thus it makes everything a lot 
easier to deal with if you have an index where each document in the 
index is a term for auto-suggest.   But that doesnt' always meet 
requirements if you need to auto-suggest within existing fq's and such, 
and of course it takes more resources to run an additional solr index.


On 1/25/2011 5:03 PM, mesenthil wrote:

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor.

So it looks like, using edgeNgram it is difficult to achieve the the
following

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael.


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?



Re: Specifying optional terms with standard (lucene) request handler?

2011-01-25 Thread Jonathan Rochkind

With the 'lucene' query parser?

include q.op=OR and then put a + (mandatory) in front of every term 
in the 'q' that is NOT optional, the rest will be optional.  I think 
that will do what want.


Jonathan

On 1/25/2011 5:07 PM, Daniel Pötzinger wrote:

Hi

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
str name=mm2/str
str name=debugQueryon/str
str name=q+lorem ipsum dolor amet/str
str name=qfcontent/str
str name=hl.fl/
str name=qtdismax/str
Will be parsed into something like this:
str name=parsedquery_toString
+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()
/str
Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
str name=qcontent:lorem #optional#content:optionalboostword^10/str

An idea would be to use a function query to boost the document:
str name=q
content:lorem _val_:query({!lucene v='optionalword^20'})
/str
Which will result in:
str name=parsedquery_toString
+content:forum +query(content:optionalword^20.0,def=0.0)
/str
Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel




RE: in-index representaton of tokens

2011-01-25 Thread Jonathan Rochkind
There aren't any tables involved. There's basically one list (per field) of 
unique tokens for the entire index, and also, a list for each token of which 
documents contain that token. Which is efficiently encoded, but I don't know 
the details of that encoding, maybe someone who does can tell you, or you can 
look at the lucene source, or get one of the several good books on lucene.  
These 'lists' are set up so you can efficiently look up a token, and see what 
documents contain that token.  That's basically what lucene does, the purpose 
of lucene. Oh, and then there's term positions and such too, so not only can 
you see what documents contain that token but you can do proximity searches and 
stuff. 

This all gets into lucene implementation details I am not familiar with though. 
 

Why do you want to know?  If you have specific concerns about disk space or RAM 
usage or something and how different schema choices effect it, ask them, and 
someone can probably tell you more easily than someone can explain the total 
architecture of lucene in a short listserv message. But, hey, maybe someone 
other than me can do that too!

From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Tuesday, January 25, 2011 7:02 PM
To: solr-user@lucene.apache.org
Subject: Re: in-index representaton of tokens

I am saying there is a list of tokens that have been parsed (a table of them)
for each column? Or one for the whole index?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jonathan Rochkind rochk...@jhu.edu
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Tue, January 25, 2011 9:29:36 AM
Subject: Re: in-index representaton of tokens

Why does it matter?  You can't really get at them unless you store them.

I don't know what table per column means, there's nothing in Solr
architecture called a table or a column. Although by column you
probably mean more or less Solr field.  There is nothing like a
table in Solr.

Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:
 So, the index is a list of tokens per column, right?

 There's a table per column that lists the analyzed tokens?

 And the tokens per column are represented as what, system integers? 32/64 bit
 unsigned ints?

   Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: Taxonomy in SOLR

2011-01-24 Thread Jonathan Rochkind
There aren't any great general purpose out of the box ways to handle 
hieararchical data in Solr.  Solr isn't an rdbms.


There may be some particular advice on how to set up a particular Solr 
index to answer particular questions with regard to hieararchical data.


I saw a great point made recently comparing rdbms to NoSQL stores, which 
applied to Solr too even though Solr is NOT a noSQL store.  In rdbms, 
you set up your schema thinking only about your _data_, and modelling 
your data as flexibly as possible. Then once you've done that, you can 
ask pretty much any well-specified question you want of your data, and 
get a correct and reasonably performant answer.


In Solr, on the other hand, we set up our schemas to answer particular 
questions. You have to first figure out what kinds of questions you will 
want to ask Solr, what kinds of queries you'll want to make, and then 
you can figure out how to structure your data to ask those questions.  
Some questions are actually very hard to set up Solr to answer -- in 
general Solr is about setting up your data so whatever question you have 
can be reduced to asking is token X in field Y.


This can be especially tricky in cases where you want to use a single 
Solr index to answer multiple questions, where the questions are such 
that you really need to set up your data _differently_ to get Solr to 
optimally answer each question.


Solr is not a general purpose store like an rdbms, where you can set up 
your schema once in terms of your data and use it to answer nearly any 
conceivable well-specified question after that.  Instead, Solr does 
things that rdbms can't do quickly or can't do at all.  But you lose 
some things too.


On 1/24/2011 3:03 AM, Damien Fontaine wrote:

Hi,

I am trying Solr and i have one question. In the schema that i set up,
there are 10 fields with always same data(hierarchical taxonomies) but
with 4 million
documents, space disk and indexing time must be big. I need this field
for auto complete. Is there another way to do this type of operation ?

Damien



RE: filter update by IP

2011-01-23 Thread Jonathan Rochkind
My favorite other external firewall'ish technology  is just an apache 
front-end reverse proxying to the Java servlet (such as Solr), with access 
controls in apache. 

I haven't actually done it with Solr myself though, my Solr is behind a 
firewall accessed by trusted apps only. Be careful making your Solr viewable to 
the world, even behind an other external firewall'ish technology.  There are 
several features in Solr you do NOT to expose to the world (the ability to 
change the index in general, of which there are a variety of ways to do it in 
addition to the /update/csv handler, the straight /update handler. Also 
consider the replication commands -- the example Solr solrconfig.xml, at least, 
will allow an HTTP request that tells Solr to replicate from arbitrarily 
specified 'master', definitely not something you'd want open to the world 
either!  There may be other examples too you might not think of at first.).  

My impression is that Solr is written assuming it will be safely ensconced 
behind a firewall and accessed by trusted applications only.  If you're not 
going to do this, you're going to have to be careful to make sure to lock down 
or remove a lot of things, /update/csv is just barely a start.  I don't know if 
anyone has analyzed and written up secure ways to do this -- it sounds like 
there would be interest for such since it keeps coming up on the list. 

Kind of personally curious _why_ it keeps coming up on the list so much. Is 
everyone trying to go into business vending Solr in the cloud to customers who 
will write their own apps, or are there some other less obvious (to me) use 
cases?


From: Erik Hatcher [erik.hatc...@gmail.com]
Sent: Sunday, January 23, 2011 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: filter update by IP

No.  SolrQueryRequest doesn't (currently) have access to the actual HTTP 
request coming in.  You'll need to do this either with a servlet filter and 
register it into web.xml or restrict it from some other external firewall'ish 
technology.

Erik

On Jan 23, 2011, at 13:21 , Teebo wrote:

 Hi

 I would like to restrict access to /update/csv request handler

 Is there a ready to use UpdateRequestProcessor for that ?


 My first idea was to heritate from CSVRequestHandler and to overload
 public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) {
  ...
  restrict by IP code
  ...
  super(req, rsp);
 }

 What do you think ?

 Regards,
 t.



RE: api key filtering

2011-01-22 Thread Jonathan Rochkind
If you COULD solve your problem by indexing 'public', or other tokens from a 
limited vocabulary of document roles, in a field -- then I'd definitely suggest 
you look into doing that, rather than doing odd things with Solr instead. If 
the only barrier is not currently having sufficient logic at the indexing stage 
to do that, then it is going to end up being a lot less of a headache in the 
long term to simply add a layer at the indexing stage to add that in, then 
trying to get Solr to do things outside of it's, well, 'comfort zone'. 

Of course, depending on your requirements, it might not be possible to do that, 
maybe you can't express the semantics in terms of a limited set of roles 
applied to documents. And then maybe your best option really is sending an up 
to 2k element list (not exactly the same list every time, presumably) of 
acceptable documents to Solr with every query, and maybe you can get that to 
work reasonably.  Depending on how many different complete lists of documents 
you have, maybe there's a way to use Solr caches effectively in that situation, 
or maybe that's not even neccesary since lookup by unique id should be pretty 
quick anyway, not really sure. 

But if the semantics are possible, much better to work with Solr rather than 
against it, it's going to take a lot less tinkering to get Solr to perform well 
if you can just send an fq=role:public or something, instead of a list of 
document IDs.  You won't need to worry about it, it'll just work, because you 
know you're having Solr do what it's built to do. Totally worth a bit of work 
to add a logic layer at the indexing stage. IMO. 

From: Erick Erickson [erickerick...@gmail.com]
Sent: Saturday, January 22, 2011 4:50 PM
To: solr-user@lucene.apache.org
Subject: Re: api key filtering

1024 is the default number, it can be increased. See MaxBooleanClauses
in solrconfig.xml

This shouldn't be a problem with 2K clauses, but expanding it to tens of
thousands is probably a mistake (but test to be sure).

Best
Erick

On Sat, Jan 22, 2011 at 3:50 PM, Matt Mitchell goodie...@gmail.com wrote:

 Hey thanks I'll definitely have a read. The only problem with this though,
 is that our api is a thin layer of app-code, with solr only (no db), we
 index data from our sql db into solr, and push the index off for
 consumption.

 The only other idea I had was to send a list of the allowed document ids
 along with every solr query, but then I'm sure I'd run into a filter query
 limit. Each key could be associated with up to 2k documents, so that's 2k
 values in an fq which would probably be too many for lucene (I think its
 limit 1024).

 Matt

 On Sat, Jan 22, 2011 at 3:40 PM, Dennis Gearon gear...@sbcglobal.net
 wrote:

  The only way that you would have that many api keys per record, is if one
  of
  them represented 'public', right? 'public' is a ROLE. Your answer is to
 use
  RBAC
  style techniques.
 
 
  Here are some links that I have on the subject. What I'm thinking of
 doing
  is:
  Sorry for formatting, Firefox is freaking out. I cut and pasted these
 from
  an
  email from my sent box. I hope the links came out.
 
 
  Part 1
 
 
 
 http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/
 
 
  Part2
  Role-based access control in SQL, part 2 at Xaprb
 
 
 
 
 
  ACL/RBAC Bookmarks ALL
 
  UserRbac - symfony - Trac
  A Role-Based Access Control (RBAC) system for PHP
  Appendix C: Task-Field Access
  Role-based access control in SQL, part 2 at Xaprb
  PHP Access Control - PHP5 CMS Framework Development | PHP Zone
  Linux file and directory permissions
  MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root
  Password
  per RECORD/Entity permissions? - symfony users | Google Groups
  Special Topics: Authentication and Authorization | The Definitive Guide
 to
  Yii |
  Yii Framework
 
  att.net Mail (gear...@sbcglobal.net)
  Solr - User - Modelling Access Control
  PHP Generic Access Control Lists
  Row-level Model Access Control for CakePHP « some flot, some jet
  Row-level Model Access Control for CakePHP « some flot, some jet
  Yahoo! GeoCities: Get a web site with easy-to-use site building tools.
  Class that acts as a client to a JSON service : JSON « GWT « Java
  Juozas Kaziukėnas devBlog
  Re: [symfony-users] Implementing an existing ACL API in symfony
  php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow
  W3C ACL System
  makeAclTables.sql
  SchemaWeb - Classes And Properties - ACL Schema
  Reardon's Ruminations: Spring Security ACL Schema for Oracle
  trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla
  Acl.php - kohana-mptt - Project Hosting on Google Code
  Asynchronous JavaScript Technology and XML (Ajax) With the Java Platform
  The page cannot be found
 
 
   Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
  better
  idea 

Re: Which QueryParser to use

2011-01-20 Thread Jonathan Rochkind

On 1/20/2011 1:42 AM, kun xiong wrote:

Thar example string means our query is BooleanQuery containing
BooleanQuerys.

I am wondering how to write a complicated BooleanQuery for dismax, like (A
or B or C) and (D or E)

Or I have to use Lucene query parser.


You can't do it with dismax. You might be able to do it with edismax, 
which is in Solr trunk/4.0 or as a patch to 1.4.


You can also do it, in 1.4,  with nested queries with dismax queries 
nested in a 'lucene' query.


But why would you want to? What do you actually want to do?  The dismax 
parser is great for taking user-entered queries and just sending them 
straight to Solr. Is that why you're interested in it?  It's also a 
convenient way to search a query over multiple fields with different 
boosts in different fields, or with other useful boosts like phrase 
boosts and such. Is that why you're interested in it?  Or something 
else?  Depending on what you want from it, the easiest solution may be 
different. Or if you don't want _anything_ from it, and are happy with a 
straight lucene-style query, then there's no reason to do use it, just 
use the straight 'lucene' query parser, no problem.


Re: Showing facet values in alphabetical order

2011-01-20 Thread Jonathan Rochkind

Are you showing the facets with facet parameters in your request?

Then you can ask for the facets to be returned sorted by byte-order with 
facet.sort=index.


Got nothing to do with your schema, let alone your DIH import 
configuration that you showed us. Just a matter of how you ask Solr for 
facets.


Byte order is not neccesarily exactly  'alphabetical' order, if your 
facets are not 7-bit-ascii and/or if they contain punctuation. If your 
facet values are just 7-bit ascii characters and spaces, it should 
basically be alphabetical order.


But that's all that Solr offers, as far as I know.

On 1/20/2011 12:34 PM, PeterKerk wrote:

I want to provide a list of facets to my visitors order alphabetically, for
example, for the 'features' facet I have:

data-config.xml:
entity name=location_feature query=select featureid from
location_features where locationid='${location.id}'
entity name=feature query=select title from features where id =
'${location_feature.featureid}' ORDER BY title ASC
field name=features column=title /
/entity
/entity

schema.xml:
field name=features type=textTight indexed=true stored=true
multiValued=true/
field name=features_raw type=string indexed=true stored=true
multiValued=true/
copyField source=features dest=features_raw/


But this doesnt give me the facets in an alphabetical order.

Besides the features facet, I also have some other facets that ALSO need to
be shown in alphabetical order. How to approach this?


Re: Adding weightage to the facets count

2011-01-20 Thread Jonathan Rochkind
Maybe?: Just keep the 'weightages' in an external store of some kind 
(rdbms, nosql like mongodb, just a straight text config file that your 
app loads into a hash internally, whatever), rather than Solr, and have 
your app look them up for each facet value to be displayed, after your 
app fetches the facet values from Solr.  There's no need to use Solr for 
this, although there might be various tricky ways to do so if you really 
wanted to, there's no perfectly straightforward way.


On 1/20/2011 12:39 PM, sivaprasad wrote:

Hi,

I am building tag cloud for products by using facets.I made tag names as
facets and i am taking facets count as reference to display tag cloud.Each
product has tags with their own weightage.Let us say,

For example

prod1 has tag called “Light Weight” with weightage 20,
prod2 has tag called “Light Weight” with weightage 100,

If i get facet for “Light Weight” , i will get Light Weight (2) ,
here i need to consider the weightage in to account, and the result will be
Light Weight (120)

How can we achieve this?Any ideas are really helpful.

Regards,
Siva



Re: Indexing all permutations of words from the input

2011-01-20 Thread Jonathan Rochkind
Why do you want to do this, what is it meant to accomplish?  There might 
be a better way to accomplish what it is you are trying to do; I can't 
think of anything (which doesn't mean it doesn't exist) that what you're 
actually trying to do would be required in order to do.  What sorts of 
queries do you intend to serve with this setup?


I don't believe there is any analyzer that will do exactly what you've 
specified, included with Solr out of the box. You could definitely write 
your own analyzer in Java to do it. But I still suspect you may not 
actually need to construct your index like that to accomplish whatever 
you are trying to accomplish.


The only point I can think of to caring what words are next to what 
other words is for phrase and proximity searches. However, with what 
you've specified, phrase and proximity searches wouldn't be at all 
useful anyway, as EVERY word would be next to every other word, so any 
phrase or proximity search including any words present at all would 
match, so might as well not do a phrase and proximity search at all, in 
which case it should not matter what order or how close together the 
words are in the index.   Why not just use an ordinary Whitespace 
Tokenizer, and just do ordinary dismax or lucene queries without using 
phrase or proximity?


On 1/20/2011 4:03 PM, Martin Jansen wrote:

Hey there,

I'm looking for ananalyzer  configuration for Solr 1.4 that
accomplishes the following:

Given the input abc xyz foo I would like to add at least the following
token combinations to the index:

abc
abc xyz
abc xyz foo
abc foo
xyz
xyz foo
foo

A WhitespaceTokenizer combined with a ShingleFilter will take me there
to some extent, but won't e.g. add abc foo to the index.  Is there a
way to do this?

- Martin


Re: Indexing all permutations of words from the input

2011-01-20 Thread Jonathan Rochkind
Aha, I have no idea if there actually is a better way of achieving that, 
auto-completion with Solr is always tricky and I personally have not 
been happy with any of the designs I've seen suggested for it.  But I'm 
also not entirely sure your design will actually work, but neither am I 
sure it won't!


I am thinking maybe for that auto-complete use, you will actually need 
your field to be NOT tokenized, so you won't want to use the WhiteSpace 
tokenizer after all (I think!) -- unless maybe there's another filter 
you can put at the end of the chain that will take all the tokens and 
join them back together,  seperated by a single space,  as a single 
token.  But I do think you'll need the whole multi-word string to be a 
single token in order to use terms.prefix how you want.


If you can't make ShingleFilter do it though, I don't think there is any 
built in analyzers that will do the transformation you want. You could 
write your own in Java, perhaps based on ShingleFilter -- or it might be 
easier to have your own software make the transformations you want and 
then simply send the pre-transformed strings to Solr when indexing. Then 
you could simply send them to a 'string' type field that won't tokenize.


On 1/20/2011 4:40 PM, Martin Jansen wrote:

On 20.01.11 22:19, Jonathan Rochkind wrote:

On 1/20/2011 4:03 PM, Martin Jansen wrote:

I'm looking for ananalyzer   configuration for Solr 1.4 that
accomplishes the following:

Given the input abc xyz foo I would like to add at least the following
token combinations to the index:

 abc
 abc xyz
 abc xyz foo
 abc foo
 xyz
 xyz foo
 foo


Why do you want to do this, what is it meant to accomplish?  There might be a 
better way to accomplish what it is you are trying to do; I can't think of 
anything (which doesn't mean it doesn't exist) that what you're actually trying 
to do would be required in order to do.  What sorts of queries do you intend to 
serve with this setup?

I'm in the process of setting up an index for term suggestion. In my use
case people should get the suggestion abc foo for the search query
abc fo and under the assumption that abc xyz foo has been submitted
to the index.

My current plan is to use TermsComponent with the terms.prefix=
parameter for this, because it seems to be pretty efficient and I get
things like correct sorting for free.

I assume there is a better way for achieving this then?

- Martin


Re: Opensearch Format Support

2011-01-20 Thread Jonathan Rochkind
No, not exactly. In general, people don't expose their Solr API direct 
to the world -- they front Solr with some software that is exposed to 
the world. (If you do expose your Solr API directly to the world, you 
will need to think carefully about security, and make sure you aren't 
letting anyone in the world do things you don't want them to do to your 
Solr index, like commit new documents!).


It would not be all that hard to write software that searches Solr on 
the backend via an OpenSearch interface -- an OpenSearch interface is 
basically just results in Atom format, usually.  And then just an 
OpenSearch Description document that just specifies what your search URL 
is. You'd have to have things like 'title' or 'last updated' or whatever 
other fields you want in your Atom result in Solr stored fields, if you 
wanted to get them purely from Solr -- and you'd have to tell this 
hypothetical OpenSearch front end what Solr stored fields to use for 
what elements in the Atom response.  So it's not something where some 
software could just go on top of any Solr index at all and provide a 
valid Atom or RSS response (which is basically all OpenSearch is).


I do not know if anyone else has already written an open source 
configurable atom/opensearch front-end to Solr, you could try googling 
around. But it would not be a very difficult task for a programmer 
familiar with Solr and with OpenSearch/Atom/RSS.


Jonathan

On 1/20/2011 4:29 PM, Tod wrote:

Does Solr support the Opensearch format?  If so could someone point me
to the correct documentation?


Thanks - Tod



Re: Return all contents from collection

2011-01-19 Thread Jonathan Rochkind
I know that this is often a performance problem -- but Erick, I am 
interested in the 'better solution' you hint at!


There are a variety of cases where you want to 'dump' all documents from 
a collection. One example might be in order to build a Google SiteMap 
for your app that's fronting your Solr. That's mine at the moment.   If 
anyone can think of a way to do this that doesn't have horrible 
performance (and bonus points if it doesn't completely mess up caches 
too by filling them with everything), that would be awesome.


Jonathan

On 1/18/2011 8:47 PM, Erick Erickson wrote:

This is usually a bad idea, but if you really must use
q=*:*start=0rows=100

Assuming that there are fewer than 1,000,000 documents in your index.

And if there are more, you won't like the performance anyway.

Why do you want to do this? There might be a better solution.

Best
Erick

On Tue, Jan 18, 2011 at 7:58 PM, Dan Baughmanda...@hostworks.com  wrote:


Is there a way I can simply tell the index to return its entire record set?

I tried starting and ending with just  a * but no dice.



Re: Local param tag voodoo ?

2011-01-19 Thread Jonathan Rochkind
What query are you actually trying to do?  There's probably a way to do 
it, possibly using nested queries -- but not using illegal syntax like 
some of your examples!  If you explain what you want to do, someone may 
be able to tell you how.  From the hints in your last message, I suspect 
nested queries _might_ be helpful to you.


On 1/19/2011 3:46 AM, Xavier SCHEPLER wrote:

Ok I was already at this point.
My facetting system use exactly what is described in this page. I read it from 
the Solr 1.4 book. Otherwise I would'nt ask.
The problem is that the filter queries doesn't affect the relevance score of 
the results so I want the terms in the main query.




From: Markus Jelsmamarkus.jel...@openindex.io
Sent: Tue Jan 18 21:31:52 CET 2011
To:solr-user@lucene.apache.org
Subject: Re: Local param tag voodoo ?


Hi,

You get an error because LocalParams need to be in the beginning of a
parameter's value. So no parenthesis first. The second query should not give an
error because it's a valid query.

Anyway, i assume you're looking for :
http://wiki.apache.org/solr/SimpleFacetParameters#Multi-
Select_Faceting_and_LocalParams

Cheers,


Hey,

here are my needs :

- a query that has tagged and untagged contents
- facets that ignore the tagged contents

I tryed :

q=({!tag=toExclude} ignored)  taken into account
q={tag=toExclude v='ignored'} take into account

Both resulted in a error.

Is this possible or do I have to try another way ?


--
Tous les courriers électroniques émis depuis la messagerie
de Sciences Po doivent respecter des conditions d'usages.
Pour les consulter rendez-vous sur
http://www.ressources-numeriques.sciences-po.fr/confidentialite_courriel.htm


Re: unix permission styles for access control

2011-01-19 Thread Jonathan Rochkind
No. There is no built in way to address 'bits' in Solr that I am aware 
of.  Instead you can think about how to transform your data at indexing 
into individual tokens (rather than bits) in one or more field, such 
that they are capable of answering your query.  Solr works in tokens as 
the basic unit of operation (mostly, basically), not characters or bytes 
or bits.


On 1/19/2011 9:48 AM, Dennis Gearon wrote:

Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'.

So 'fieldName.x' is how to address bits?


  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Toke Eskildsent...@statsbiblioteket.dk
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Sent: Wed, January 19, 2011 12:23:04 AM
Subject: Re: unix permission styles for access control

On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote:

I was wondering if the are binary operation filters? Haven't seen any in the
book nor was able to find any using google.

So if I had 0600(octal) in a permission field, and I wanted to return any
records that 'permission  0400(octal)==TRUE', how would I filter that?

Don't you mean permission  0400(octal) == 0400? Anyway, the
functionality can be accomplished by extending your index a bit.


You could split the permission into user, group and all parts, then use
an expanded query.

If the permission is 0755 it will be indexed as
user_p:7 group_p:5 all_p:5

If you're searching for something with at least 0650 your query should
be expanded to
(user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5)


Alternatively you could represent the bits explicitly in the index:
user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5

Then a search for 0650 would query with
user_p:2 AND user_p:4 AND group_p:1 AND group_p:4


Finally you could represent all valid permission values, still split
into parts with
user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7
group_p:1 group_p:2 group_p:3 group_p:4 group_p:5
all_p:1 all_p:2 all_p:3 all_p:4 all_p:5

The query would be simply
user_p:6 AND group_p:5


Re: unix permission styles for access control

2011-01-19 Thread Jonathan Rochkind
Yep, that's what I'm suggesting as one possible approach to consider, 
whether it will work or not depends on your specifics.


Character length in a token doesn't really matter for solr performance.  
It might be less confusing  to actually put read update delete own (or 
whatever 'o' stands for) in a field, such that it will be tokenized so 
each of those words is a seperate token.  (Make sure you aren't stemming 
or using synonyms, heh!).


Or instead of seperating a single string into tokens, use a multi-valued 
String field, and put read, delete, etc in as seperate values. That 
is actually more straightforward and less confusing than tokenizing.


Then you can just search for fq=permissions:read or whatever.

Again, whether this will actually work for you depends on exactly what 
you're requirements are, but it's something to consider, before 
resorting to weird patches.  It will work in any Solr version.


The first approach to solving a problem in Solr should be trying to 
think Can I solve this by setting up my index in such a way that I can 
ask the questions I want simply by asking if a certain token is in a 
certain field?  Because that's what Solr does, basically, tell you if 
certain tokens are in certain fields. If you can reduce the problem to 
that, Solr will handle it easily, simply, and efficiently.  Otherwise, 
you might need weird patches. :)


On 1/19/2011 12:45 PM, Dennis Gearon wrote:

So, if I used something like r-u-d-o in a field (read,update,delete,others) I
could get it tokenized to those four characters,and then search for those in
that field. Is that what you're suggesting, (thanks by the way).

An article I read created a 'hybrid' access control system (can't remember if it
was ACL or RBAC). It used a primary system like Unix file system 9bit permission
for the primary permissions normally needed on most objects of any kind, and
then flagged if there were any other permissions and any other groups. It was
very fast for the primary permissons, and fast for the secondary.


  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jonathan Rochkindrochk...@jhu.edu
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Sent: Wed, January 19, 2011 8:40:30 AM
Subject: Re: unix permission styles for access control

No. There is no built in way to address 'bits' in Solr that I am aware
of.  Instead you can think about how to transform your data at indexing
into individual tokens (rather than bits) in one or more field, such
that they are capable of answering your query.  Solr works in tokens as
the basic unit of operation (mostly, basically), not characters or bytes
or bits.

On 1/19/2011 9:48 AM, Dennis Gearon wrote:

Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'.

So 'fieldName.x' is how to address bits?


   Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a
better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Toke Eskildsent...@statsbiblioteket.dk
To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
Sent: Wed, January 19, 2011 12:23:04 AM
Subject: Re: unix permission styles for access control

On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote:

I was wondering if the are binary operation filters? Haven't seen any in the
book nor was able to find any using google.

So if I had 0600(octal) in a permission field, and I wanted to return any
records that 'permission   0400(octal)==TRUE', how would I filter that?

Don't you mean permission   0400(octal) == 0400? Anyway, the
functionality can be accomplished by extending your index a bit.


You could split the permission into user, group and all parts, then use
an expanded query.

If the permission is 0755 it will be indexed as
user_p:7 group_p:5 all_p:5

If you're searching for something with at least 0650 your query should
be expanded to
(user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5)


Alternatively you could represent the bits explicitly in the index:
user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5

Then a search for 0650 would query with
user_p:2 AND user_p:4 AND group_p:1 AND group_p:4


Finally you could represent all valid permission values, still split
into parts with
user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7
group_p:1 group_p:2 group_p:3 group_p:4 group_p:5
all_p:1 all_p:2 all_p:3 all_p:4 all_p:5

The query would be simply
user_p:6 AND group_p:5


Re: facet or filter based on user's history

2011-01-19 Thread Jonathan Rochkind
The problem is going to be 'near real time' indexing issues.  Solr 1.4 
at least does not do a very good job of handling very frequent commits. 
If you want to add to the user's history in the Solr index ever time 
they click the button, and they click the button a lot, and this 
naturally leads to commits very frequent commits to Solr (every minute, 
every second, multiple times a second), you're going to have RAM and 
performance problems.


I believe there are some things in trunk that make handling this better, 
don't know the details but near real time search is what people talk 
about, to google or ask on this list.


Or, if it's acceptable for your requirements, you could record all the 
I've read this clicks in an external store, and only add them to the 
Solr index nightly, or even hourly.  If you batch em and add em as 
frequently as you can get away with (every hour sure, every 10 minutes 
pushing it, every minute, no), you can get around that issue. Or for 
that matter you could ADD em to Solr but only 'commit' every hour or 
whatever, but I don't like that strategy since if Solr crashes or 
otherwise restarts you pretty much lose those pending commits, better to 
queue em up in an external store.


On 1/19/2011 1:52 PM, Markus Jelsma wrote:

Hi,

I've never seen Solr's behaviour with a huge amount of values in a multi
valued but i think it should work alright. Then you can stored a list of user
ID's along with each book document and user filter queries to include or
exclude the book from the result set.

Cheers,


Hi,

I'm looking for ideas on how to make an efficient facet query on a
user's history with respect to the catalog of documents (something
like Read document already: yes / no). The catalog is around 100k
titles and there are several thousand users. Of course, each user has
a different history, many having read fewer than 500 titles, but some
heavy users having read perhaps 50k titles.

Performance is not terribly important right now so all I did was bump
up the boolean query limit and put together a big string of document
id's that the user has read. The first query is slow but once it's in
the query cache it's fine. I would like to find a better way of doing
it though.

What type of solr plugin would be best suited to helping in this
situation? I could make a function plugin that provides something like
hasHadBefore() - true/false, but would that be efficient for faceting
and filtering? Another idea is a QParserPlugin that looks for a field
like hasHadBefore:userid and somehow substitutes in the list of docs.
But I'm not sure how a new parser plugin would interact with the
existing parser. Can solr use a parser plugin to only handle one
field, and leave all the other fields to the default parser?

Thanks,
Jon


Re: performance during index switch

2011-01-19 Thread Jonathan Rochkind

During commit?

A commit (and especially an optimize) can be expensive in terms of both 
CPU and RAM as your index grows larger, leaving less CPU for querying, 
and possibly less RAM which can cause Java GC slowdowns in some cases.


A common suggestion is to use Solr replication to seperate out a Solr 
index that you index to, and then replicate to a slave index that 
actually serves your queries. This should minimize any performance 
problems on your 'live' Solr while indexing, although there's still 
something that has to be done for the actual replication of course. 
Haven't tried it yet myself.  Plan to -- my plan is actually to put them 
both on the same server (I've only got one), but in seperate JVMs, and 
on a server with enough CPU cores that hopefully the indexing won't 
steal CPU the querying needs.


On 1/19/2011 2:23 PM, Tri Nguyen wrote:

Hi,
  
Are there performance issues during the index switch?
  
As the size of index gets bigger, response time slows down?  Are there any studies on this?
  
Thanks,
  
Tri


Re: performance during index switch

2011-01-19 Thread Jonathan Rochkind

On 1/19/2011 2:56 PM, Tri Nguyen wrote:

Yes, during a commit.
  
I'm planning to do as you suggested, having a master do the indexing and replicating the index to a slave which leads to my next questions.
  
During the slave replicates the index files from the master, how does it impact performance on the slave?


That I am not certain, because I haven't done it yet myself, but I am 
optimistic it will be tolerable.


As with any commit, when the slave replicates it will temporarily make a 
second copy of any changed index files (possibly the whole index), and 
it will then set up new searchers on the new copy of the index, and it 
will warm that new index, and then once warmed, it'll switch live 
searches over to the new index, and delete any old copies of indexes.


So you may still need a bunch of 'extra' RAM in the JVM to accomodate 
that overlap period.  You will need some extra diskspace. But the actual 
CPU I mean, it will take some CPU for the slave to run the new 
warmers, but it should be tolerable not very noticeable... I'm hoping.


One main benefit of the replication setup is that you can _optimize_ on 
the master, which will be completely out of the way of the slave.


Even with the replication setup, you still can't commit (ie pull down 
changes from master) near real time in 1.4 though, you can't commit so 
often that a new index is not done warming when a new commit comes in, 
or your Solr will grind to a halt as it uses too much CPU and RAM. There 
are various ways people have suggested you can try to work around this, 
but I havne't been too happy with any of em, I think it's best just not 
to commit/pull down changes from master that often.  Unless you REALLY 
need to, and are prepared to get into details of Solr to figure out how 
to make it work as well as it can.


Re: Search on two core and two schema

2011-01-18 Thread Jonathan Rochkind
Solr can't do that. Two cores are two seperate cores, you have to do two 
seperate queries, and get two seperate result sets.


Solr is not an rdbms.

On 1/18/2011 12:24 PM, Damien Fontaine wrote:

I want execute this query :

Schema 1 :
field name=id type=string indexed=true stored=true
required=true /
field name=title type=string indexed=true stored=true
required=true /
field name=UUID_location type=string indexed=true stored=true
required=true /

Schema 2 :
field name=UUID_location type=string indexed=true stored=true
required=true /
field name=label type=string indexed=true stored=true
required=true /
field name=type type=string indexed=true stored=true
required=true /

Query :
select?facet=truefl=titleq=title:*facet.field=UUID_locationrows=10qt=standard

Result :

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=facettrue/str
str name=fltitle/str
str name=qtitle:*/str
str name=facet.fieldUUID_location/str
str name=qtstandard/str
/lst
/lst
result name=response numFound=1889 start=0
doc
str name=titletitre 1/str
/doc
doc
str name=titleTitre 2/str
/doc
/result
lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=UUID_location
int name=Japan998/int
int name=China891/int
/lst
/lst
lst name=facet_dates/
  /lst
/response

Le 18/01/2011 17:55, Stefan Matheis a écrit :

Okay .. and .. now .. you're trying to do what? perhaps you could give us an
example, w/ real data .. sample queries   - results.
because actually i cannot imagine what you want to achieve, sorry

On Tue, Jan 18, 2011 at 5:24 PM, Damien Fontainedfonta...@rosebud.frwrote:


On my first schema, there are informations about a document like title,
lead, text etc and many UUID(each UUID is a taxon's ID)
My second schema contains my taxonomies with auto-complete and facets.

Le 18/01/2011 17:06, Stefan Matheis a écrit :

   Search on two cores but combine the results afterwards to present them in

one group, or what exactly are you trying to do Damien?

On Tue, Jan 18, 2011 at 5:04 PM, Damien Fontainedfonta...@rosebud.fr

wrote:

   Hi,

I would like make a search on two core with differents schemas.

Sample :

Schema Core1
   - ID
   - Label
   - IDTaxon
...

Schema Core2
   - IDTaxon
   - Label
   - Hierarchy
...

Schemas are very differents, i can't group them. Have you an idea to
realize this search ?

Thanks,

Damien







Re: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-13 Thread Jonathan Rochkind
It's a known 'issue' in dismax, (really an inherent part of dismax's 
design with no clear way to do anything about it), that qf over fields 
with different stop word definitions will produce odd results for a 
query with a stopword.


Here's my understanding of what's going on: 
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/


On 1/12/2011 6:48 PM, Markus Jelsma wrote:

Here's another thread on the subject:
http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-
td493483.html

And slightly off topic: you'd also might want to look at using common grams,
they are really useful for phrase queries that contain stopwords.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory



Here is what debug says each of these queries parse to:

1. q=lifedefType=edismaxqf=Title  ... returns 277,635 results
2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results
3. q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

1. +DisjunctionMaxQuery((Title:life))
2. +((DisjunctionMaxQuery((Title:life)))~1)
3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life))
4. +((DisjunctionMaxQuery((Contributor:the))
DisjunctionMaxQuery((Contributor:life | Title:life)))~2)

I see what's going on here.  Because the is a stop word for Title, it
gets removed from first part of the expression.  This means that
Contributor is required to contain the.  dismax does the same thing
too.  I guess I should have run debug before asking the mail list!

It looks like the only workarounds I have is to either filter out the
stopwords in the client when this happens, or enable stop words for all
the fields that are used in qf with stopword-enabled fields.
Unless...someone has a better idea??

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, January 12, 2011 4:44 PM
To: solr-user@lucene.apache.org
Cc: Jayendra Patil
Subject: Re: StopFilterFactory and qf containing some fields that use it
and some that do not


Have used edismax and Stopword filters as well. But usually use the fq
parameter e.g. fq=title:the life and never had any issues.

That is because filter queries are not relevant for the mm parameter which
is being used for the main query.


Can you turn on the debugQuery and check whats the Query formed for all
the combinations you mentioned.

Regards,
Jayendra

On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James

james.d...@ingrambook.comwrote:

I'm running into a problem with StopFilterFactory in conjunction with
(e)dismax queries that have a mix of fields, only some of which use
StopFilterFactory.  It seems that if even 1 field on the qf parameter
does not use StopFilterFactory, then stop words are not removed when
searching any fields.  Here's an example of what I mean:

- I have 2 fields indexed:
Title is textStemmed, which includes StopFilterFactory (see
below). Contributor is textSimple, which does not include
StopFilterFactory

(see below).
- The is a stop word in stopwords.txt
- q=lifedefType=edismaxqf=Title  ... returns 277,635 results
- q=the lifedefType=edismaxqf=Title ... returns 277,635 results
- q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
results

It seems as if the stop words are not being stripped from the query
because qf contains a field that doesn't use StopFilterFactory.  I
did testing with combining Stemmed fields with not Stemmed fields in
qf and it seems as if stemming gets applied regardless.  But stop
words do not.

Does anyone have ideas on what is going on?  Is this a feature or
possibly a bug?  Any known workarounds?  Any advice is appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

fieldType name=textSimple class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

fieldType name=textStemmed class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=0 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
Scanning for only 'valid' utf-8 is definitely not simple.  You can 
eliminate some obviously not valid utf-8 things by byte ranges, but you 
can't confirm valid utf-8 alone by byte ranges. There are some bytes 
that can only come after or before other certain bytes to be valid utf-8.


There is no good way to do what you're doing, once you've lost track of 
what encoding something is in, you are reduced to applying heuristics to 
text strings to guess what encoding it is meant to be.


There is no cheap way to do this to an entire Solr index, you're just 
going to have to fetch every single (stored field, indexed fields are 
pretty much lost to you) and apply heuristic algorithms to it.  Keep in 
mind that Solr really probably shouldn't ever be used as your canonical 
_store_ of data; Solr isn't a 'store', it's an index.  So you really 
ought to have this stuff stored somewhere else if you want to be able to 
examine it or modify it like this, and just deal with that somewhere 
else.  This isn't really a Solr question at all, really, even if you are 
querying Solr on stored fields to try and guess their char encodings.


There are various packages of such heuristic algorithms to guess char 
encoding, I wouldn't try to write my own. icu4j might include such an 
algorithm, not sure.


On 1/13/2011 1:12 PM, Peter Karich wrote:

  take a look also into icu4j which is one of the contrib projects ...


converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html


We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?





RE: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
So you're allowed to put the entire original document in a stored field in 
Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, 
beaurocracy. But no reason what you are doing won't work, as you of course 
already know from doing it.  

If you actually know the charset of a document when indexing it, you might want 
to consider putting THAT in a stored field; easier to keep track of the 
encoding you know then to try and guess it again later. 


From: Paul [p...@nines.org]
Sent: Thursday, January 13, 2011 6:21 PM
To: solr-user@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8

Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that reindexes,
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote:
 it does: 
 http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
 this takes a sample of the file and makes a guess.


RE: start value in queries zero or one based?

2011-01-13 Thread Jonathan Rochkind
You could have tried it and seen for yourself on any Solr server in your 
possession in less time than it took to have this thread. And if you don't have 
a Solr server, then why do you care?

But the answer is 0. 

http://wiki.apache.org/solr/CommonQueryParameters#start
The default value is 0

Since the default start is 0, and if you leave start out you don't always skip 
the first item of your result set, that means if you DO want to skip the first 
item if your result set, start=1 will do it. 


From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Thursday, January 13, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: start value in queries zero or one based?

I'm migrating to CTO/CEO status in life due to building a small company. I find
I don't have too much time for theory. I work with wht is.

So, what is it, not what should it be.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Walter Underwood wun...@wunderwood.org
To: solr-user@lucene.apache.org
Sent: Thu, January 13, 2011 1:38:26 PM
Subject: Re: start value in queries zero or one based?

On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote:

 Do I even need a body for this message? ;-)

 Dennis Gearon

Are you asking is it or should it be? If the latter, we can also discuss
Emacs and vi.

wunder
--
Walter Underwood
K6WRU


Re: pruning search result with search score gradient

2011-01-12 Thread Jonathan Rochkind
Some times I've _considered_ trying to do this (but generally decided it 
wasn't worth it) was when I didn't want those documents below the 
threshold to show up in the facet values.  In my application the facet 
counts are sometimes very pertinent information, that are sometimes not 
quite as useful as they could be when they include barely-relevant hits.


On 1/12/2011 11:42 AM, Erick Erickson wrote:

What's the use-case you're trying to solve? Because if you're
still showing results to the user, you're taking information away
from them. Where are you expecting to get the list? If you try
to return the entire list, you're going to pay the penalty
of creating the entire list and transmitting it across the wire rather
than just a pages' worth.

And if you're paging, the user will do this for you by deciding for
herself when she's getting less relevant results.

So I don't understand what the value to the end user you're trying
to provide is, perhaps if you elaborate on that I'll have more useful
response

Best
Erick

On Tue, Jan 11, 2011 at 3:12 AM, Julien Piquotjulien.piq...@arisem.comwrote:


Hi everyone,

I would like to be able to prune my search result by removing the less
relevant documents. I'm thinking about using the search score : I use the
search scores of the document set (I assume there are sorted by descending
order), normalise them (0 would be the the lowest value and 1 the greatest
value) and then calculate the gradient of the normalised scores. The
documents with a gradient below a threshold value would be rejected.
If the scores are linearly decreasing, then no document is rejected.
However, if there is a brutal score drop, then the documents below the drop
are rejected.
The threshold value would still have to be tuned but I believe it would
make a much stronger metric than an absolute search score.

What do you think about this approach? Do you see any problem with it? Is
there any SOLR tools that could help me dealing with that?

Thanks for your answer.

Julien



Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind
I see a lot of people using shards to hold different types of 
documents, and it almost always seems to be a bad solution. Shards are 
intended for distributing a large index over multiple hosts -- that's 
it.  Not for some kind of federated search over multiple schemas, not 
for access control.


Why not put everything in the same index, without shards, and just use 
an 'fq' limit in order to limit to the specific document you'd like to 
search over in a given search?I think that would achieve your goal a 
lot more simply than shards -- then you use sharding only if and when 
your index grows to be so large you'd like to distribute it over 
multiple hosts, and when you do so you choose a shard key that will have 
more or less equal distribution accross shards.


Using shards for access control or schema management just leads to 
headaches.


[Apparently Solr could use some highlighted documentation on what shards 
are really for, as it seems to be a very common issue on this list, 
someone trying to use them for something else and then inevitably 
finding problems with that approach.]


Jonathan

On 1/7/2011 6:48 AM, supersoft wrote:

The reason of this distribution is the kind of the documents. In spite of
having the same schema structure (and solr conf), a document belongs to 1 of
5 different kinds.

Each kind corresponds to a concrete shard and due to this, the implemented
client tool avoids searching in all the shards when the users selects just
one or a few of kinds. The tool runs a multisharded query of the proper
shards. I guess this is a right approach but correct me if I am wrong.

The real problem of this architecture is the correlation between concurrent
users and response time:
1 query: n seconds
2 queries: 2*n second each query
3 queries: 3*n seconds each query
and so...

This is being a real headache because 1 single query has an acceptable
response time but when many users are accessing to the server the
performance goes hardly down.


Re: Tuning StatsComponent

2011-01-10 Thread Jonathan Rochkind
I found StatsComponent to be slow only when I didn't have enough RAM 
allocated to the JVM.  I'm not sure exactly what was causing it, but it 
was pathologically slow -- and then adding more RAM to the JVM made it 
incredibly fast.


On 1/10/2011 4:58 AM, Gora Mohanty wrote:

On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com  wrote:

Hello.

i`m using the StatsComponent to get the sum of amounts. but solr
statscomponent is very slow on a huge index of 30 Million documents. how can
i tune the statscomponent ?

Not sure about this problem.


the problem is, that i have 5 currencys and i need to send for each currency
a new request. thats make the solr search sometimes very slow. =(

[...]

I guess that you mean the search from the front-end is slow.

It is difficult to make a guess without details of your index,
and of your queries, but one thing that immediately jumps
out is that you could shard the Solr index by currency, and
have your front-end direct queries for each currency to the
appropriate Solr server.

Please do share a description of what all you are indexing,
how large your index is, and what kind of queries you are
running. I take it that you have already taken a look at
http://wiki.apache.org/solr/SolrPerformanceFactors

Regards,
Gora



Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind

On 1/10/2011 5:03 PM, Dennis Gearon wrote:

What I seem to see suggested here is to use different cores for the things you
suggested:
   different types of documents
   Access Control Lists

I wonder how sharding would work in that scenario?


Sharding has nothing to do with that scenario at all. Different cores 
are essentially _entirely seperate_.  While it can be convenient to use 
different cores like this, it means you don't get ANY searches that 
'join' over multiple 'kinds' of data in different cores.


Solr is not great at handling hetereogenous data like that.  Putting it 
in seperate cores is one solution, although then they are entirely 
seperate.  If that works, great.  Another solution is putting them in 
the same index, but using mostly different fields, and perhaps having a 
'type' field shared amongst all of your 'kinds' of data, and then always 
querying with an 'fq' for the right 'kind'.  Or if the fields they use 
are entirely different, you don't even need the fq, since a query on a 
certain field will only match a certain 'kind' of document.


Solr is not great at handling complex queries over data with 
hetereogenous schemata. Solr wants you to to flatten all your data into 
one single set of documents.


Sharding is a way of splitting up a single index (multiple cores are 
_multiple indexes_) amongst several hosts for performance reasons, 
mostly when you have a very large index.  That is it.  The end.  if you 
have multiple cores, that's the same as having multiple solr indexes 
(which may or may not happen to be on the same machine). Any one or more 
of those cores could be sharded if you want. This is a seperate issue.






Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind
And I don't think I've seen anyone suggest a seperate core just for 
Access Control Lists. I'm not sure what that would get you.


Perhaps a separate store that isn't Solr at all, in some cases.

On 1/10/2011 5:36 PM, Jonathan Rochkind wrote:

Access Control Lists


RE: (FQ) Filter Query Caching Differences with OR and AND?

2011-01-06 Thread Jonathan Rochkind
Disclaimer:  I am not actually familiar with the solr code, all of the below is 
extrapolation from being pretty familiar with Solr's behavior. 

Yeah, it would be nice, but it would be a lot harder to code for solr.  Right 
now, the thing putting and retrieving entries into/from the filter cache 
doesn't really need to parse the query at all.  It just takes the whole query 
and uses it (effectively) for a cache key.  Keep in mind that Solr has 
pluggable query parsers, and the fq can (quite usefully!) be used with any 
query parser, not just lucene.  lucene, dismax, field, raw, a few others out of 
the box, others not officially part of solr but that users might write and use 
with their solr. Query parsers can be in use (and work with filter cache) that 
didn't even exist when the filter caching logic was written.  This is actually 
a very useful feature -- if there's behavior that's possible with lucene but 
not supported in a convenient way (or at all) by Solr API, you can write a 
query parser to do it yourself if you need to -- and your query parser will 
plug right in, and all other Solr features (such as filter caching!) will still 
work fine with it. 

So to get the filter caching to somehow go inside the query and cache and 
retrieve parts of it --  it would probably really need to be something each 
query parser were responsible for --  storing and retrieving elements from the 
filter cache as part of it's ordinary query parsing behavior -- but only when 
it was inside an fq, not a q, which I'm not sure the query parser even knows 
right now.  Right now I think the query parser doesn't even have to know about 
the filter cache -- if an fq is retrieved from cache, then it doesn't even make 
it to the query parser. 

So yeah, it would be useful if seperate components of an fq query could be 
cached seperately -- but it would also be a lot more complicated.  But I'm sure 
nobody would mind seeing a patch if you want to figure it out. :)


From: Em [mailformailingli...@yahoo.de]
Sent: Thursday, January 06, 2011 2:36 AM
To: solr-user@lucene.apache.org
Subject: Re: (FQ) Filter Query Caching Differences with OR and AND?

Thank you Jonathan.

fq=foo:barfq=foo:baz seems to be the better alternative for fq=foo:bar
AND foo:baz if foo:bar and foo:baz were often used in different
combinations (not always together).

However, in most of the usecases I can think of, an fq=foo:bar OR
foo:baz-behaviour is expected and it would be nice if this fq would benefit
from a cached fq=foo:bar.

I can imagine why this is not the case, if only one of two fq-clauses were
cached.
However, when foo:bar and foo:baz were cached seperately, why not
benefiting from them when a fq=foo:bar OR foo:baz or fq=foo:bar AND
foo:baz is requested?

Who is responsible for putting fq's in the filterCache? I think one has to
modify the logic of that class do benefit from already cached but recombined
filterCaches.
This would have a little bit less performance than caching the entire
foo:bar AND foo:baz BitVector, since you need to reproduce one for that
special use-case, but I think the usage of the cache is far more efficient,
if foo:bar and foo:baz occur very frequently but foo:bar AND foo:baz
do not.

What do you think?

Regards


Jonathan Rochkind wrote:

 Each 'fq' clause is it's own cache key.

 1. fq=foo:bar OR foo:baz
  = one entry in filter cache

 2. fq=foo:barfq=foo:baz
 = two entries in filter cache, will not use cached entry from #1

 3. fq=foo:bar
   = One entry, will use cached entry from #2

 4. fq=foo:bar
= One entry, will use cached entry from #2.

 So if you do queries in succession using each of those four fq's in
 order, you will wind up with 3 entries in the cache.

 Note that fq=foo:bar OR foo:baz is not semantically identical to
 fq=foofq=bar.  Rather that latter is semantically identical to
 fq=foo:bar AND foo:baz.   But fq=foofq=bar will be two cache
 entries, and fq=foo:bar AND foo:baz will be one cache entry, and the
 two won't share any cache entries.


 On 1/5/2011 3:17 PM, Em wrote:
 Hi,

 while reading through some information on the list and in the wiki, i
 found
 out that something is missing:

 When I specify a filter queries like this

 fq=foo:bar OR foo:baz
 or
 fq=foo:barfq=foo:baz
 or
 fq=foo:bar
 or
 fq=foo:baz

 How many filter query entries will be cached?
 Two, since there are two filters (foo:bar, foo:baz) or 3, since there are
 three different combinations (foo:bar OR foo:baz, foo:bar, foo:baz)?

 Thank you!



Jonathan Rochkind wrote:

 Each 'fq' clause is it's own cache key.

 1. fq=foo:bar OR foo:baz
  = one entry in filter cache

 2. fq=foo:barfq=foo:baz
 = two entries in filter cache, will not use cached entry from #1

 3. fq=foo:bar
   = One entry, will use cached entry from #2

 4. fq=foo:bar
= One entry, will use cached entry from #2.

 So if you do queries in succession using each of those four fq's in
 order, you will wind

Re: searching against unstemmed text

2011-01-04 Thread Jonathan Rochkind
Do you have to do anything special to search against a field in Solr?  
No, that's what Solr does.


Please be more specific about what you are trying to do, what you expect 
to happen, and what happens instead.


If your Solr field is analyzed to stem, then indeed you can only match 
stemmed tokens, because that's the only tokens that are there.  You can 
create a different solr field that is not stemmed for wildcard searches 
if you like, which is perhaps what you're trying to do, but you haven't 
really told us.


On 1/4/2011 10:00 AM, Wodek Siebor wrote:

I'm trying to search using text_rev field, which is by default enabled in
the schema.xml,
but it doesn't work at all. Do I have to do anything special here.

I want to search using wildcards and searching against text field works
fine, except I can only find results against stemmed text.

Thanks,
Wlodek


Re: Sub query using SOLR?

2011-01-04 Thread Jonathan Rochkind

Yeah, I don't believe there's any good way to do it in Solr 1.4.

You can make two queries, first make your 'sub' query, get back the list 
of values, then construct the second query where you do {!field 
v=field_name} val1 OR val2 OR val3   OR valN


Kind of a pain, and there is a maximum number of conditions you can have 
in there (1024 maybe?).


It is OFT requested behavior, and the feature on SOLR-2272 is very 
exciting to me and I think would meet a lot of needs, but I haven't 
tried it yet myself.


Jonathan

On 1/4/2011 2:03 PM, Steven A Rowe wrote:

Hi Barani,

I haven't tried it myself, but the limited JOIN functionality provided by 
SOLR-2272 sounds very similar to what you want to do:

https://issues.apache.org/jira/browse/SOLR-2272

Steve


-Original Message-
From: bbarani [mailto:bbar...@gmail.com]
Sent: Tuesday, January 04, 2011 1:27 PM
To: solr-user@lucene.apache.org
Subject: Sub query using SOLR?


Hi,

I am trying to use subquery in SOLR, is there a way to implement this
using
SOLR query syntax?

Something like

Related_id: IN query(field=ud, q=”type:IT AND manager_12:dave”)

The thing I really want is to use output of one query to be the input of
another query.

Not sure if it is possible to use the query() function (function query)
for
my case..

Just want to know if ther is a better approach...

Thanks,
Barani
--
View this message in context: http://lucene.472066.n3.nabble.com/Sub-
query-using-SOLR-tp2193251p2193251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Advice on Exact Matching?

2011-01-04 Thread Jonathan Rochkind
There is a hacky kind of thing that Bill Dueber figured out for using 
multiple fields and dismax to BOOST exact matches, but include all 
matches in the result set.


You have to duplicate your data in a second non-tokenized field. Then 
you use dismax pf to super boost matches on the non-tokenized field. 
Because 'pf' is a phrase search, you don't run into trouble with dismax 
pre-tokenization on white space, even though it's a field that might 
have internal-token whitespace. (Using a non-tokenized field with dismax 
qf will basically never match a result with whitespace, unless it's 
phrase-quoted in query. But pf works.).


Because it was a non-tokenized field, it only matches (and triggers the 
dismax ps super boost) if it's an exact match. And it works. You CAN 
normalize your 'exact match' field in field analysis, removing 
punctuation or normalizing whitespace or whatever, and that works too, 
doing it both at index and query time analysis.




On 1/4/2011 4:28 PM, Chris Hostetter wrote:

: I am trying to make sure that when I search for text—regardless of
: what that text is—that I get an exact match.  I'm *still* getting some
: issues, and this last mile is becoming very painful.  The solr field,
: for which I'm setting this up on, is pasted below my explanation.  I
: appreciate any help.

if you are using a TextField with some analysis components, it's
virtually impossible to get exact matches -- where my definition of
exact is that the query text is character for character identical to the
entire field value indexed.

is your definition of exact match different?  i assme it must be since you
are using TextField and talk about wanting to deal with whitespace between
words.  so i think you need to explain a little bit better what your
indexed data looks like, and what sample queries you expect to match that
data (and equally important: what queries should *not* match thta data,
and what data should *not* match those queries)

: If I want to find *all* Solr documents that match
: [id]somejunk\hi[/id] then life is instantly hell.

90% of the time when people have problems with exact matches it's
because of QueryParser meta characters -- characters like :, [ and
whitespace that the QUeryParser uses as instructions.  you can use the
raw QParser to have every character treated as a literal

defType=raw
q=[id]somejunk\hi[/id]

-Hoss



Re: DIH and UTF-8

2010-12-29 Thread Jonathan Rochkind
I haven't tried it yet, but I _think_ in Rails if you are using the 
'mysql2' adapter (now standard with Rails3) instead of 'mysql', it might 
handle utf-8 better with less areas for gotchas.  I think if the 
underlying mysql database is set to use utf-8, then, at least with 
mysql2 adapter, you shouldn't need to set 'encoding' attribute on the 
database connection definition. But I could be wrong, and this isn't 
really about solr anymore of course.


On 12/29/2010 9:48 AM, Mark wrote:

Sure thing.

In my database.yml I was missing the encoding: utf8 option.

If one were to add unicode characters within rails (console, web form,
etc) the characters would appear to be saved correctly... ie when trying
to retrieve them back, everything looked perfect. The characters also
appeared correctly using the mysql prompt. However when trying to index
or retrieve those characters using JDBC/Solr the characters were mangled.

After adding the above utf8 encoding option I was able to correctly save
utf8 characters into the database and retrieve them using JDBC/Solr.
However when using the mysql client all the characters would show up as
all mangled or as ''. This was resolved by running the following
query set names utf8;.

On 12/28/10 10:17 PM, Glen Newton wrote:

Hi Mark,

Could you offer a more technical explanation of the Rails problem, so
that if others encounter a similar problem your efforts in finding the
issue will be available to them?  :-)

Thanks,
Glen

PS. This has wandered somewhat off-topic to this list: apologies
thanks for the patience of this list...

On Tue, Dec 28, 2010 at 4:15 PM, Markstatic.void@gmail.com   wrote:

It was due to the way I was writing to the DB using our rails application.
Everythin looked correct but when retrieving it using the JDBC driver it was
all managled.

On 12/27/10 4:38 PM, Glen Newton wrote:

Is it possible your browser is not set up to properly display the
chinese characters? (I am assuming you are looking at things through
your browser)
Do you have any problems viewing other chinese documents properly in
your browser?
Using mysql, can you see these characters properly?

What happens when you use curl or wget to get a document from solr and
looking at it using something besides your browser?

Yes, I am running out of ideas!  :-)

-Glen

On Mon, Dec 27, 2010 at 7:22 PM, Markstatic.void@gmail.com wrote:

Just like the user of that thread... i have my database, table, columns
and
system variables all set but it still doesnt work as expected.

Server version: 5.0.67 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql SHOW VARIABLES LIKE 'collation%';
+--+-+
| Variable_name| Value   |
+--+-+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server | utf8_general_ci |
+--+-+
3 rows in set (0.00 sec)

mysql SHOW VARIABLES LIKE 'character_set%';
+--++
| Variable_name| Value  |
+--++
| character_set_client | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results| utf8   |
| character_set_server | utf8   |
| character_set_system | utf8   |
| character_sets_dir   | /usr/local/mysql/share/mysql/charsets/ |
+--++
8 rows in set (0.00 sec)


Any other ideas? Thanks


On 12/27/10 3:23 PM, Glen Newton wrote:

[client]

   default-character-set = utf8
   [mysql]
   default-character-set=utf8
   [mysqld]
   character_set_server = utf8
   character_set_client = utf8




RE: Solr 1.4.1 stats component count not matching facet count for multi valued field

2010-12-23 Thread Jonathan Rochkind
Interesting, the wiki page on StatsComponent says multi-valued fields may be 
slow , and may use lots of memory. http://wiki.apache.org/solr/StatsComponent

Apparently it should also warn that multi-valued fields may not work at all? 
I'm going to add that with a link to the JIRA ticket. 

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Thursday, December 23, 2010 7:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 1.4.1 stats component count not matching facet count for 
multi valued field

: I have a facet field called option which may be multi-valued and
: a weight field which is single-valued.
:
: When I use the Solr 1.4.1 stats component with a facet field, i.e.
...
: I get conflicting results for the stats count result

a jira search for solr stats multivalued would have given you...

https://issues.apache.org/jira/browse/SOLR-1782

-Hoss


RE: Solr 1.4.1 stats component count not matching facet count for multi valued field

2010-12-23 Thread Jonathan Rochkind
Aha! Thanks, sorry, I'll clarify on my wiki edit. 

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Friday, December 24, 2010 12:11 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 1.4.1 stats component count not matching facet count for 
multi valued field

: Interesting, the wiki page on StatsComponent says multi-valued fields
: may be slow , and may use lots of memory.
: http://wiki.apache.org/solr/StatsComponent

*stats* over multivalued fields work, but use lots of memory -- that bug
only hits you when you compute stats over any field, that are faceted by a
multivalued field.


-Hoss


RE: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria

2010-12-22 Thread Jonathan Rochkind
This won't actually give you the number of distinct facet values, but will give 
you the number of documents matching your conditions. It's more equivalent to 
SQL without the distinct. 

There is no way in Solr 1.4 to get the number of distinct facet values. 

I am not sure about the new features in trunk.  

From: Peter Karich [peat...@yahoo.de]
Sent: Wednesday, December 22, 2010 6:10 AM
To: solr-user@lucene.apache.org
Subject: Re: solr equiv of : SELECT count(distinct(field)) FROM index WHERE 
length(field)  0 AND other_criteria

 facets=truefacet.field=field // SELECT count(distinct(field))
fq=field:[* TO *]  // WHERE length(field)  0
q=other_criteriaAfq=other_criteriaB// AND other_criteria

advantage: you can look into several fields at one time when adding
another facet.field
disadvantage: you get the counts splitted by the values of that field

fix this via field collapsing / results grouping
http://wiki.apache.org/solr/FieldCollapsing
or use deduplication: http://wiki.apache.org/solr/Deduplication

Regards,
Peter.

 Hi,

 Is there a way with faceting or field collapsing to do the SQL equivalent of

 SELECT count(distinct(field)) FROM index WHERE length(field)  0 AND
 other_criteria

 i.e. I'm only interested in the total count not the individual records
 and counts.

 Cheers,
 Dan


--
http://jetwick.com open twitter search



Re: Duplicate values in multiValued field

2010-12-22 Thread Jonathan Rochkind
In my experience, that should work fine. Facetting in 1.4 works fine on 
multi-valued fields, and a duplicate value in the multi-valued field 
shouldn't be a problem.


On 12/22/2010 2:31 AM, Andy wrote:

If I put duplicate values into a multiValued field, would that cause any issues?

For example I have a multiValued field Color. Some of my documents have 
duplicate values for that field, such as: Green, Red, Blue, Green, Green.

Would the above (having 3 duplicate Green) be the same as having the duplicated 
values of: Green, Red, Blue?

Or do I need to clean my data and remove duplicate values before indexing?

Thanks.






Re: White space in facet values

2010-12-22 Thread Jonathan Rochkind
Another technique, which works great for facet fq's and avoids the need 
to worry about escaping, is using the field query parser instead:


fq={!field f=Product}Electric Guitar

Using the field query parser avoids the need for ANY escaping of your 
value at all, which is convenient in the facetting case -- you still 
need to URI-escape (ampersands for instance), but you shouldn't need to 
escape any Solr special characters like parens or double quotes or 
anything else, if you've made your string suitable for including in a 
URI. With the field query parser, a lot less to worry about.


http://lucene.apache.org/solr/api/org/apache/solr/search/FieldQParserPlugin.html

On 12/22/2010 9:53 AM, Dyer, James wrote:

The phrase solution works as does escaping the space with a backslash:  
fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped 
like this (amperstands and parenthesis come to mind)...

I assume you already have this indexed as string, not text...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Andy [mailto:angelf...@yahoo.com]
Sent: Wednesday, December 22, 2010 1:11 AM
To: solr-user@lucene.apache.org
Subject: White space in facet values

How do I handle facet values that contain whitespace? Say I have a field Product that I want to facet on. A 
value for Product could be Electric Guitar. How should I handle the white space in 
Electric Guitar during indexing? What about when I apply the constraint fq=Product:Electric Guitar?






Re: White space in facet values

2010-12-22 Thread Jonathan Rochkind
Huh, does !term in 4.0 mean the same thing as !field in 1.4?  What you 
describe as !term in 4.0 dev is what I understand as !field in 1.4 doing.


On 12/22/2010 10:01 AM, Yonik Seeley wrote:

On Wed, Dec 22, 2010 at 9:53 AM, Dyer, Jamesjames.d...@ingrambook.com  wrote:

The phrase solution works as does escaping the space with a backslash:  
fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped 
like this (amperstands and parenthesis come to mind)...

One way to avoid escaping is to use the raw or term query parsers:

fq={!raw f=Product}Electric Guitar

In 4.0-dev, use {!term} since that will work with field types that
need to transform the external representation into the internal one
(like numeric fields need to do).

http://wiki.apache.org/solr/SolrQuerySyntax

-Yonik
http://www.lucidimagination.com





I assume you already have this indexed as string, not text...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Andy [mailto:angelf...@yahoo.com]
Sent: Wednesday, December 22, 2010 1:11 AM
To: solr-user@lucene.apache.org
Subject: White space in facet values

How do I handle facet values that contain whitespace? Say I have a field Product that I want to facet on. A 
value for Product could be Electric Guitar. How should I handle the white space in 
Electric Guitar during indexing? What about when I apply the constraint fq=Product:Electric Guitar?






Re: Solr query to get results based on the word length (letter count)

2010-12-22 Thread Jonathan Rochkind
No good way. At indexing time, I'd just store the number of chars in the 
title in a field of it's own.  You can possibly do that solely in 
schema.xml with clever use of analyzers and copyField.


Solr isn't an rdbms.  Best to de-normalize at index time so what you're 
going to want to query is in the index.


On 12/22/2010 10:36 AM, Giri wrote:

Hi,

I have a solar index that has thousands of records, the title is one of the
solar fields, and I would like to query for title values that are less than
50 characters long. Is there a way to construct the Solr query to provide
results based on the character length?


thank you very much!



Re: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria

2010-12-22 Thread Jonathan Rochkind
Well, that's true -- you can get the total number of facet values if you 
ALSO are willing to get back every facet value in the response.


If you've got a hundred thousand or so unique facet values, and what you 
really want is just the _count_ without ALSO getting back a very large 
response (and waiting for Solr to construct the very large response), 
then you're out of luck.


But if you're willing to get back all the values in the response too, 
that'll work, true.


On 12/22/2010 11:23 AM, Erik Hatcher wrote:

On Dec 22, 2010, at 09:21 , Jonathan Rochkind wrote:


This won't actually give you the number of distinct facet values, but will give you the 
number of documents matching your conditions. It's more equivalent to SQL without the 
distinct.

There is no way in Solr 1.4 to get the number of distinct facet values.

That's not true - the total number of facet values is the distinct number of 
values in that field.   You need to be sure you have facet.limit=-1 (default is 
100) to see all values in the response rather than just a page of them though.

Erik




Re: full text search in multiple fields

2010-12-22 Thread Jonathan Rochkind

Did you reindex after you changed your analyzers?

On 12/22/2010 12:57 PM, PeterKerk wrote:

Hi guys,

There's one more thing to get this code to work as I need I just found
out...

Im now using:q=title_search:hort*defType=lucene
as iorixxx suggested.

it works good BUT, this query doesnt find results if the title in DB is
Hortus supremus

I tried adding some tokenizers and filters to solve this, what I think is a
casing issue, but no luck...

below is my code...what am I missing here?

Thanks again!


fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/

!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
ignoreCase=true expand=false/
--
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_dutch.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_dutch.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
/fieldType


field name=title type=text_ws indexed=true stored=true/
field name=title_search type=text indexed=true stored=true/
copyField source=title dest=title_search/


Re: Case Insensitive sorting while preserving case during faceted search

2010-12-21 Thread Jonathan Rochkind
Hoss, I think the use case being asked about is specifically doing a 
facet.sort though, for cases where you actually do want to sort facet 
values with facet.sort, not sort records -- while still presenting the 
facet values with original case, but sorting them case insensitively.


The solutions offered at those URLs don't address this.

Because I'm pretty sure there isn't really any good solution for this, 
Solr just won't do that, just how it goes.


On 12/21/2010 2:33 PM, Chris Hostetter wrote:

: I am trying to do a facet search and sort the facet values too.
...
: Then I followed the sample example schema.xml, created a copyField of type
...
:   fieldType name=alphaOnlySort class=solr.TextField
: sortMissingLast=true omitNorms=true
...
: But the sorted facet values dont have their case preserved anymore.
:
: How can I get around this?

Did you look at how/why/when alphaOnlySort is used in the example?

The FAQ entry you refered to address almost the exact same scnerio with
wanting to search/sort on the same data...

http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F

...the simplest thing to do is to use copyField to index a second version
of your field using the StrField class.


So have one version of your field using StrField that you facet on, and
copyField that to another version (using TextField and
KeywordTokenizer) that you sort on.



-Hoss



RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-14 Thread Jonathan Rochkind
But the entirety of the old indexes (no longer on disk) wasn't cached in 
memory, right?  Or is it?  Maybe this is me not understanding lucene enough. I 
thought that portions of the index were cached in disk, but that sometimes the 
index reader still has to go to disk to get things that aren't currently in 
caches.  If this is true (tell me if it's not!), we have an index reader that 
was based on indexes that... are no longer on disk. But the index reader is 
still open. What happens when it has to go to disk for info?

And the second replication will trigger a commit even if there are in fact no 
new files to be transfered over to slave, because there have been no changes 
since the prior sync with failed commit?

From: Upayavira [...@odoko.co.uk]
Sent: Tuesday, December 14, 2010 2:23 AM
To: solr-user@lucene.apache.org
Subject: RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

The second commit will bring in all changes, from both syncs.

Think of the sync part as a glorified rsync of files on disk. So the
files will have been copied to disk, but the in memory index on the
slave will not have noticed that those files have changed. The commit is
intended to remedy that - it causes a new index reader to be created,
based upon the new on disk files, which will include updates from both
syncs.

Upayavira

On Mon, 13 Dec 2010 23:11 -0500, Jonathan Rochkind rochk...@jhu.edu
wrote:
 Sorry, I guess I don't understand the details of replication enough.

 So slave tries to replicate. It pulls down the new index files. It tries
 to do a commit but fails.  But the next commit that does succeed will
 have all the updates. Since it's a slave, it doesn't get any commits of
 it's own. But then some amount of time later, it does another replication
 pull. There are at this time maybe no _new_ changes since the last failed
 replication pull. Does this trigger a commit that will get those previous
 changes actually added to the slave?

 In the meantime, between commits.. are those potentially large pulled new
 index files sitting around somewhere but not replacing the old slave
 index files, doubling disk space for those files?

 Thanks for any clarification.

 Jonathan
 
 From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley
 [yo...@lucidimagination.com]
 Sent: Monday, December 13, 2010 10:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't
 WeakHashMap getting collected?

 On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  Yonik, how will maxWarmingSearchers in this scenario effect replication?  
  If a slave is pulling down new indexes so quickly that the warming 
  searchers would ordinarily pile up, but maxWarmingSearchers is set to 1 
  what happens?

 Like any other commits, this will limit the number of searchers
 warming in the background to 1.  If a commit is called, and that tries
 to open a new searcher while another is already warming, it will fail.
  The next commit that does succeed will have all the updates though.

 Today, this maxWarmingSearchers check is done after the writer has
 closed and before a new searcher is opened... so calling commit too
 often won't affect searching, but it will currently affect indexing
 speed (since the IndexWriter is constantly being closed/flushed).

 -Yonik
 http://www.lucidimagination.com



Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-14 Thread Jonathan Rochkind

Yeah, I understand basically how caches work.

What I don't understand is what happens in replication if, the new 
segment files are succesfully copied, but the actual commit fails due to 
maxAutoWarmingSearches.  The new files are on disk... but the commit 
could not succeed and there is NOT a new index reader, because the 
commit failed.   And there is potentially a long gap before a future 
succesful commit.


1. Will the existing index searcher have problems because the files have 
been changed out from under it?


2. Will a future replication -- at which NO new files are available on 
master -- still trigger a future commit on slave?


Maybe these are obvious to everyone but me, because I keep asking this 
question, and the answer I keep getting is just describing the basics of 
replication, as if this obviously answers my question.


Or maybe the answer isn't obvious or clear to anyone including me, in 
which case the only way to get an answer is to try and test it myself.  
A bit complicated to test, at least for my level of knowledge, as I'm 
not sure exactly what I'd be looking for to answer either of those 
questions.


Jonathan

On 12/14/2010 9:53 AM, Upayavira wrote:

A Lucene index is made up of segments. Each commit writes a segment.
Sometimes, upon commit, some segments are merged together into one, to
reduce the overall segment count, as too many segments hinders
performance. Upon optimisation, all segments are (typically) merged into
a single segment.

Replication copies any new segments from the master to the slave,
whether they be new segments arriving from a commit, or new segments
that are a result of a segment merge. The result is a set of index files
on disk that are a clean mirror of the master.

Then, when your replication process has finished syncing changed
segments, it fires a commit on the slave. This causes Solr to create a
new index reader.

When the first query comes in, this triggers Solr to populate caches.
Whoever was unfortunate to cause that cache population will see poorer
results (we've seen 40s responses rather than 1s).

The solution to this is to set up an autowarming query in
solrconfig.xml. This query is executed against the new index reader,
causing caches to populate from the updated files on disk. Only once
that autowarming query has completed will the index reader be made
available to Solr for answering search queries.

There's some cleverness that I can't remember the details of specifying
how much to keep from the existing caches, and how much to build up from
the files on disk. If I recall, it is all configured in solrconfig.xml.

You ask a good question whether a commit will be triggered if the sync
brought over no new files (i.e. if the previous one did, but this one
didn't). I'd imagine that Solr would compare the maximum segment ID on
disk with the one in memory to make such a decision, in which case Solr
would spot the changes from the previous sync and still work. The best
way to be sure is to try it!

The simplest way to try it (as I would do it) would be to:

1) switch off post-commit replication
2) post some content to solr
3) commit on the master
4) use rsync to copy the indexes from the master to the slave
5) do another (empty) commit on the master
6) trigger replication via an HTTP request to the slave
7) See if your posted content is available on your slave.

Maybe someone else here can tell you what is actually going on and save
you the effort!

Does that help you get some understand what is going on?

Upayavira

On Tue, 14 Dec 2010 09:15 -0500, Jonathan Rochkindrochk...@jhu.edu
wrote:

But the entirety of the old indexes (no longer on disk) wasn't cached in
memory, right?  Or is it?  Maybe this is me not understanding lucene
enough. I thought that portions of the index were cached in disk, but
that sometimes the index reader still has to go to disk to get things
that aren't currently in caches.  If this is true (tell me if it's not!),
we have an index reader that was based on indexes that... are no longer
on disk. But the index reader is still open. What happens when it has to
go to disk for info?

And the second replication will trigger a commit even if there are in
fact no new files to be transfered over to slave, because there have been
no changes since the prior sync with failed commit?

From: Upayavira [...@odoko.co.uk]
Sent: Tuesday, December 14, 2010 2:23 AM
To: solr-user@lucene.apache.org
Subject: RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't
WeakHashMap getting collected?

The second commit will bring in all changes, from both syncs.

Think of the sync part as a glorified rsync of files on disk. So the
files will have been copied to disk, but the in memory index on the
slave will not have noticed that those files have changed. The commit is
intended to remedy that - it causes a new index reader to be created,
based upon the new on disk files, which will include updates from both

<    1   2   3   4   5   >