Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then 
query for mixedCase will no longer also match mixed Case.


I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase 
to match both/either mixed Case or mixedCase in the index. (with 
case insensitivity on top of that via another filter).


That would support things like names like duBois which are sometimes 
spelled du bois and sometimes dubois, and allow the query duBois 
to match both in the index.


I had somehow thought that was what WDF was intended for. But it's 
actually not the usual functioning, and may not be realistic?


I'm a bit confused about what splitOnCaseChange combined with 
catenateWords is meant to do at all.  It _is_ generating both the split 
and single-word tokens at query time -- but not in a way that actually 
allows it to match both the split and single-word tokens?  What is 
supposed to be the purpose/use case for splitOnCaseChange with 
catenateWords? If any?


Jonathan

On 12/29/14 7:20 PM, Erick Erickson wrote:

Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

On 12/29/14 5:24 PM, Jack Krupansky wrote:


WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction



I do not understand what separate query/index analysis you are suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson agreed
this was an intended use case for the WDF, but maybe I explained it poorly.
Erick if you're around and want to at least confirm whether WDF is supposed
to do this in your understanding, that would be great!

Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind
I guess I don't understand what the four use cases are, or the three out 
of four use cases, or whatever. What the intended uses of the WDF are.


Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time), 
or is that a nonsensical configuration that nobody should ever use?


I understand how analysis can be different at index vs query time. I 
think what I don't fully understand is what the possibilities and 
intended use case of the WDF are, with various configurations.


I thought one of the intended use cases, with appropriate configuration, 
was to do what I'm talking: allow mixedCase query to match both mixed 
Case and mixed Case in the index. I think you're saying I'm wrong, 
and this is not something WDF can do? Can you confirm I understand you 
right?


Thanks!

Jonathan

On 12/30/14 11:30 AM, Jack Krupansky wrote:

Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:


Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:


Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:


On 12/29/14 5:24 PM, Jack Krupansky wrote:



WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that distinction




I do not understand what separate query/index analysis you are
suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same
query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I
just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson
agreed
this was an intended use case for the WDF, but maybe I explained it
poorly.
Erick if you're around and want to at least confirm whether WDF is
supposed
to do this in your understanding, that would be great!

Jonathan







Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

On 12/30/14 11:45 AM, Alexandre Rafalovitch wrote:

On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote:

I'm a bit confused about what splitOnCaseChange combined with catenateWords
is meant to do at all.  It _is_ generating both the split and single-word
tokens at query time


Have you tried only having WDF during indexing with both options set?
And same chain but without WDF at all during query?


Without WDF at all in the query, then mixedCase in query would match 
mixedCase in index, but would no longer match mixed Case in index.


I thought I was using WDF in such a way that mixedCase in query could 
match both/either mixedCase and/or mixed Case in the index. And I 
thought this was an intended use case of the WDF.


But perhaps I was wrong, and the WDF simply can't do this?  Is WDF 
intended mainly for use at index time and not query time? In general, 
I'm confused about the various things WDF can and can't do, and the 
various configurations to make it do that.


Thanks for everyone's advice.


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind
Okay, thanks. I'm not sure if it's my lack of understanding, but I feel 
like I'm having a very hard time getting straight answers out of you 
all, here.


I want the query mixedCase to match both/either mixed Case and 
mixedCase in the index.


What configuration of WDF at index/query time would do this?

This isn't neccesarily the only thing I want WDF to do, but it's 
something I want it to do and thought it was doing and found out it 
wasn't. So we can isolate/simplify to there -- if I can figure out what 
WDF configuration (if any?) can do that first, then I can always move on 
to figuring out how/if that impacts the other things I want WDF to do.


So is there a WDF configuration that can do that? Or is the problem that 
it's confusing, and none of you all are sure either if there is what it 
would be, it's not clear?


Jonathan

On 12/30/14 12:02 PM, Jack Krupansky wrote:

I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You're not wrong about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, expert
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:


I guess I don't understand what the four use cases are, or the three out
of four use cases, or whatever. What the intended uses of the WDF are.

Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time),
or is that a nonsensical configuration that nobody should ever use?

I understand how analysis can be different at index vs query time. I think
what I don't fully understand is what the possibilities and intended use
case of the WDF are, with various configurations.

I thought one of the intended use cases, with appropriate configuration,
was to do what I'm talking: allow mixedCase query to match both mixed
Case and mixed Case in the index. I think you're saying I'm wrong, and
this is not something WDF can do? Can you confirm I understand you right?

Thanks!

Jonathan


On 12/30/14 11:30 AM, Jack Krupansky wrote:


Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  Thanks Erick!


Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase
to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split
and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to
be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:

  Jonathan:


Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  On 12/29/14 5:24 PM, Jack Krupansky wrote:




WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that
distinction




I do not understand what

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

On 12/30/14 12:35 PM, Walter Underwood wrote:

You want preserveOriginal=“1”.

You should only do this processing at index time.


If I only do this processing at index time, then mixedCase at query 
time will no longer match mixed Case in the index/source material.


I think I'm having trouble explaining. Let's say the source material 
being indexed included mixed Case, not mixedCase.  I want 
mixedCase in query to still match it.


But if the source material that went into the index contained 
mixedCase, I still want mixedCase in query to match it as well.




Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind
Okay, some months later I've come back to this with an isolated 
reproduction case. Thanks very much for any advice or debugging help you 
can give.


The WordDelimiter filter is making a mixed-case query NOT match the 
single-case source, when it ought to.


I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no 
sense to debug here, and I need to install and try to reproduce on a 
more recent version).


I have an index that includes ONE document (deleted and reindexed after 
index change), with content in only one field (text) other than 'id', 
and that content is one word: delalain.


My analysis (both index and query, I don't have different ones) for the 
'text' field is simply:


fieldType name=text class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true

  analyzer
tokenizer class=solr.ICUTokenizerFactory /

filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 catenateWords=1 splitOnCaseChange=1/


filter class=solr.ICUFoldingFilterFactory /
  /analyzer
/fieldType

I am querying simply with eg /select?defType=luceneq=text%3Adelalain

Querying for delalain finds this document, as expected. Querying for 
DELALAIN finds this document, as expected (note the ICUFoldingFactory).


However, querying for deLALAIN does not find this document, which is 
unexpected.


INDEX analysis of the source, delalain, ends in this in the index, 
which seems pretty straightforward, so I'll only bother pasting in the 
final index analysis:


##
textdelalain
raw_bytes   [64 65 6c 61 6c 61 69 6e]
position1
start   0
end 8
typeALPHANUM
script  Latin
###




QUERY analysis of the problematic query, deLALAIN, looks like this:

#
ICUTtextdeLALAIN
raw_bytes   [64 65 4c 41 4c 41 49 4e]   
start   0   
end 8   
typeALPHANUM
script  Latin   
position1   


WDF textde  LALAIN  deLALAIN
raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 
4e]
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
position1   2   2
script  Common  Common  Common


ICUFF   textde  lalain  delalain
raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 
6e]
position1   2   2
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
script  Common  Common  Common
###



It's obviously the WordDelimiterFilter that is messing things up -- but 
how/why, and is it a bug?


It wants to search for both de lalain as a phrase, as well as 
alternately delalain as one word -- that's the intended supported 
point of the WDF with this configuration, right? And should work?


The problem is that is not succesfully matching delalain as one word 
-- so, how to figure out why not and what to do about it?


Previously, Erick and Diego asked for the info from debug=query, so 
here is that as well:



lst name=debug
  str name=rawquerystringtext:deLALAIN/str
  str name=querystringtext:deLALAIN/str
  str name=parsedqueryMultiPhraseQuery(text:de (lalain 
delalain))/str

  str name=parsedquery_toStringtext:de (lalain delalain)/str
  str name=QParserLuceneQParser/str
/lst


Hmm, that does not seem to quite look like neccesarily, if I interpret 
that correctly, it's looking for de followed by either lalain or 
delalain.  Ie, it would match de delalain?  But that's not right at 
all.


So, what's gone wrong? Something with WDF with configuration to 
generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's 
a bug, one that might be fixed in a more recent Solr?).


Thanks!

Jonathan




On 9/3/14 7:15 PM, Erick Erickson wrote:

Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case, I'm
getting confusing results that suggest some other part of my field def may
be pertinent.

I'll come back when I've done that (hopefully next week), and include the
_parsed_ from debug=query then. Thanks!

Jonathan



On 9/2/14 4:26 PM, Erick Erickson wrote:


What happens if you append

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind

On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction


I do not understand what separate query/index analysis you are 
suggesting to accomplish what I wanted.


I understand the WDF, like all software, is not magic, of course. But I 
thought this was an intended use case of the WDF, with those settings:


A mixedCase query would match mixedCase in the index; and the same 
query mixedCase would also match two separate words mixed Case in 
index.  (Case insensitively since I apply an ICUFoldingFilter on top of 
that).


Was I wrong, is this not an intended thing for the WDF to do? Or do I 
just have the wrong configuration options for it to do it? Or is it a bug?


When I started this thread a few months ago, I think Erick Erickson 
agreed this was an intended use case for the WDF, but maybe I explained 
it poorly. Erick if you're around and want to at least confirm whether 
WDF is supposed to do this in your understanding, that would be great!


Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-03 Thread Jonathan Rochkind
Thanks Erick and Diego. Yes, I noticed in my last message I'm not 
actually using defaults, not sure why I chose non-defaults originally.


I still need to find time to make a smaller isolation/reproduction case, 
I'm getting confusing results that suggest some other part of my field 
def may be pertinent.


I'll come back when I've done that (hopefully next week), and include 
the _parsed_ from debug=query then. Thanks!


Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:

What happens if you append debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


On 9/2/14 1:51 PM, Erick Erickson wrote:


bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't
have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?



Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both index
and query phases (is that right?):

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

It's hard to cut and paste the results of the analysis page into email (or
anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
whole real world app complex field definition. I'll also paste in our
entire field definition below. But I realize my next step is probably
creating a simpler isolation/reproduction case (unless you have a magic
answer from this!).

Again, the problem is that MacBook seems to be only matching on indexed
macbook and not indexed mac book.


MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

   fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
   analyzer
!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
 so our synonym filter involving C++ etc can still work.
 From: https://mail-archives.apache.
org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
6070...@elyograg.org%3E
 the rbbi file is in our local ./conf, copied from lucene
source tree --
tokenizer class=solr.ICUTokenizerFactory
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/

filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt
ignoreCase=true/

 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/


 !-- folding need sto be after WordDelimiter, so WordDelimiter
  can do it's thing with full cases and such --
 filter class=solr.ICUFoldingFilterFactory /


 !-- ICUFolding already includes lowercasing, no
  need for seperate lowercasing step
 filter class=solr.LowerCaseFilterFactory/
 --

 filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType









WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind
Hello, I'm running into a case where a query is not returning the 
results I expect, and I'm hoping someone can offer some explanation that 
might help me fine tune things or understand what's up.


I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that 
downcases everything for case-insensitive searching. It includes many 
other things too, but I think these are the pertinent facts.


For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and 
ELALAIN split into two tokens, and another with just one token.


Later, all the tokens are lowercased by another filter in the chain. 
(actually an ICU filter which is doing something more complicated than 
just lowercasing, but I think we can consider it lowercasing for the 
purposes of this discussion).


If I understand right what the WordDelimiterFilter is trying to do here, 
it's probably doing something special because of the lowercase d 
followed by an uppercase letter, a special case for that. (I don't get 
this behavior with other mixed case queries not beginning with 'd').


And, what I think it's trying to do, is match text indexed as d 
elalain as well as text indexed by delalain.


The problem is, it's not accomplishing that -- it is NOT matching text 
that was indexed as delalain (one token).


I don't entirely understand what the position attribute is for -- but 
I wonder if in this case, the position on dELALAIN is really supposed 
to be 1, not 2?  Could that be responsible for the bug?  Or is position 
irrelevant in this case?


If that's not it, then I'm at a loss as to what may be causing this bug 
-- or even if it's a bug at all, or I'm just not understanding intended 
behavior. I expect a query for dELALAIN to match text indexed as 
delalain (because of the forced lowercasing in the filter chain). But 
it's not doing so. Are my expectations wrong? Bug? Something else?


Thanks for any advice,

Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind

Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as 
they are long and complicated. I can narrow it down to an isolation case 
if I need to. My indexed field in question is relatively short strings.


But what it's got to do with is the WordDelimiterFilter's default 
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.


Let's take a less confusing example, query MacBook. With a 
WordDelimiterFilter followed by something that downcases everything.


I think what the WDF (followed by case folding) is trying to do is make 
query MacBook match both indexed text mac book as well as macbook 
-- either one should be a match. Is my understanding right of what 
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is 
intending to do?


In my actual index, query MacBook is matching ONLY mac book, and not 
macbook.  Which is unexpected. I indeed want it to match both. (I 
realize I could make it match only 'macbook' by setting 
splitOnCaseChange=0 and/or generateWordParts=0).


It's possible this is happening as a side effect of other parts of my 
complex field definition, and I really do need to post hte whole thing 
and/or isolate it. But I wonder if there are known general problem cases 
that cause this kind of failure, or any known bugs in 
WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.


And I wonder if WordDelimiter filter spitting out the token MacBook 
with position 2 rather than 1 is expected, irrelevant, or possibly a 
relevant problem.


Thanks again,

Jonathan

On 9/2/14 12:59 PM, Michael Della Bitta wrote:

Hi Jonathan,

Little confused by this line:


And, what I think it's trying to do, is match text indexed as d elalain

as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Hello, I'm running into a case where a query is not returning the results
I expect, and I'm hoping someone can offer some explanation that might help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than just
lowercasing, but I think we can consider it lowercasing for the purposes of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d followed
by an uppercase letter, a special case for that. (I don't get this behavior
with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as d elalain
as well as text indexed by delalain.

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as delalain (one token).

I don't entirely understand what the position attribute is for -- but I
wonder if in this case, the position on dELALAIN is really supposed to be
1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug --
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for dELALAIN to match text indexed as
delalain (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan





Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind
Yes, thanks, I realize I can twiddle those parameters, but it will 
probably result in MacBook no longer matching mac book at all, but 
ONLY matching macbook.


My understanding of the default settings of WordDelimiterFactory is that 
they are intending for MacBook to match both mac book AND macbook.


I will try to create an isolation reproduction that demonstrates this 
ruling out interference from other filters (or identifying the other 
filters), to make my question more clear, I guess.


Jonathan

On 9/2/14 1:34 PM, Michael Della Bitta wrote:

If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case if
I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query MacBook. With a
WordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is make
query MacBook match both indexed text mac book as well as macbook --
either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?

In my actual index, query MacBook is matching ONLY mac book, and not
macbook.  Which is unexpected. I indeed want it to match both. (I realize
I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
generateWordParts=0).

It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in WordDelimiterFilter
(in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token MacBook with
position 2 rather than 1 is expected, irrelevant, or possibly a
relevant problem.

Thanks again,

Jonathan


On 9/2/14 12:59 PM, Michael Della Bitta wrote:


Hi Jonathan,

Little confused by this line:

  And, what I think it's trying to do, is match text indexed as d elalain



as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/
112002776285509593336/posts
w: appinions.com http://www.appinions.com/



On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  Hello, I'm running into a case where a query is not returning the results

I expect, and I'm hoping someone can offer some explanation that might
help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many
other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than
just
lowercasing, but I think we can consider it lowercasing for the purposes
of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d
followed
by an uppercase letter, a special case for that. (I don't get this
behavior
with other mixed case queries not beginning with 'd').

And, what I think it's

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind

On 9/2/14 1:51 PM, Erick Erickson wrote:

bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?


Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both 
index and query phases (is that right?):


filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=1/


It's hard to cut and paste the results of the analysis page into email 
(or anywhere!), I'll give you screenshots, sorry -- and I'll give them 
for our whole real world app complex field definition. I'll also paste 
in our entire field definition below. But I realize my next step is 
probably creating a simpler isolation/reproduction case (unless you have 
a magic answer from this!).


Again, the problem is that MacBook seems to be only matching on 
indexed macbook and not indexed mac book.



MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

  fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer
   !-- the rulefiles thing is to keep ICUTokenizerFactory from 
stripping punctuation,

so our synonym filter involving C++ etc can still work.
From: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
the rbbi file is in our local ./conf, copied from lucene 
source tree --
   tokenizer class=solr.ICUTokenizerFactory 
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/


   filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt ignoreCase=true/


filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/



!-- folding need sto be after WordDelimiter, so WordDelimiter
 can do it's thing with full cases and such --
filter class=solr.ICUFoldingFilterFactory /


!-- ICUFolding already includes lowercasing, no
 need for seperate lowercasing step
filter class=solr.LowerCaseFilterFactory/
--

filter class=solr.SnowballPorterFilterFactory 
language=English protected=protwords.txt/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType






Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Jonathan Rochkind

On 12/17/13 1:16 PM, Chris Hostetter wrote:

As i mentioned in the blog above, as long as you have a uniqueKey field
that supports range queries, bulk exporting of all documents is fairly
trivial by sorting on your uniqueKey field and using an fq that also
filters on your uniqueKey field modify the fq each time to change the
lower bound to match the highest ID you got on the previous page.


Aha, very nice suggestion, I hadn't thought of this, when myself trying 
to figure out decent ways to 'fetch all documents matching a query' for 
some bulk offline processing.


One question that I was never sure about when trying to do things like 
this -- is this going to end up blowing the query and/or document caches 
if used on a live Solr?  By filling up those caches with the results of 
the 'bulk' export?  If so, is there any way to avoid that? Or does it 
probably not really matter?


Jonathan


Re: json update moves doc to end

2013-12-03 Thread Jonathan Rochkind

What order, the order if you supply no explicit sort at all?

Solr does not make any guarantees about what order documents will come 
back in if you do not ask for a sort.


In general in Solr/lucene, the only way to update a document is to 
re-add it as a new document, so that's probably what's going on behind 
the scenes, and it probably effects the 'default' sort order -- which 
Solr makes no agreement about anyway, you probably shouldn't even count 
on it being consistent at all.


If you want a consistent sort order, maybe add a field with a timestamp, 
and ask for results sorted by the timestamp field? And then make sure 
not to change the timestamp when you do an update that you don't want to 
change the order?


Apologies if I've misunderstood the situation.

On 12/3/13 1:00 PM, Andreas Owen wrote:

When I search for agenda I get a lot of hits. Now if I update the 2.
Result by json-update the doc is moved to the end of the index when I search
for it again. The field I change is editorschoice and it never contains
the search term agenda so I don't see why it changes the order. Why does
it?



Part of Solrconfig requesthandler I use:

requestHandler name=/select2 class=solr.SearchHandler

  lst name=defaults

 str name=echoParamsexplicit/str

 int name=rows10/int

  str name=defTypesynonym_edismax/str

str name=synonymstrue/str

str name=qfplain_text^10 editorschoice^200

title^20 h_*^14

tags^10 thema^15 inhaltstyp^6 breadcrumb^6
doctype^10

contentmanager^5 links^5

last_modified^5  url^5

/str

str name=bq(expiration:[NOW TO *] OR (*:*
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost
--

str name=bflog(clicks)^8/str !-- tested --

!-- todo: anzahl-links(count urlparse in links query) /
häufigkeit von suchbegriff (bf= count in title and text)--

  str name=dftext/str

str name=fl*,path,score/str

str name=wtjson/str

str name=q.opAND/str



!-- Highlighting defaults --

 str name=hlon/str

  str name=hl.flplain_text,title/str

str name=hl.simple.prelt;bgt;/str

 str name=hl.simple.postlt;/bgt;/str



  !-- lst name=invariants --

 str name=faceton/str

str name=facet.mincount1/str

 str
name=facet.field{!ex=inhaltstyp}inhaltstyp/str

str
name=f.inhaltstyp.facet.sortindex/str

str
name=facet.field{!ex=doctype}doctype/str

str name=f.doctype.facet.sortindex/str

str
name=facet.field{!ex=thema_f}thema_f/str

str name=f.thema_f.facet.sortindex/str

str
name=facet.field{!ex=author_s}author_s/str

str name=f.author_s.facet.sortindex/str

str
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str

str
name=f.sachverstaendiger_s.facet.sortindex/str

str
name=facet.field{!ex=veranstaltung}veranstaltung/str

str
name=f.veranstaltung.facet.sortindex/str

str
name=facet.date{!ex=last_modified}last_modified/str

str
name=facet.date.gap+1MONTH/str

str
name=facet.date.endNOW/MONTH+1MONTH/str

str
name=facet.date.startNOW/MONTH-36MONTHS/str

str
name=facet.date.otherafter/str

/lst

/requestHandler




Re: Need idea to standardize keywords - ring tone vs ringtone

2013-10-28 Thread Jonathan Rochkind
Do you know about the Solr synonym feature?  That seems more applicable 
to what you're describing then stopwords. I'd stay away from stopwords 
entirely here, and try to do what you want with synonyms.


Multi-word synonyms can be tricky, I'm not entirely sure the right way 
to do it for this use case. But I think the synonym feature is what you 
want. Not the stopwords feature.




On 10/28/13 12:24 PM, Developer wrote:

Thanks for your response Eric. Sorry for the confusion.

I currently display both 'ring tone' as well as 'ringtone' when the user
types in 'r' but I am trying to figure out a way to display just 'ringtone'
hence I added 'ring tone' to stopwords list so that it doesn't get indexed.

I have the list of know keywords (more like synonyms) which I am trying to
map against the user entered keywords.

ring tone, ringer tine = ringtone





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-idea-to-standardize-keywords-ring-tone-vs-ringtone-tp4097794p4098103.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Jonathan Rochkind
This is good to know, and I find it welcome advice; I would recommend 
making sure this advice is clearly highlighted in the relevant Solr 
docs, such as any getting started docs.


I'm not sure everyone realizes this, and some go down tomcat route 
without realizing the Solr committers recommend jetty -- or use a stock 
jetty without realizing the 'example' jetty is recommended and actually 
intended to be used by Solr users in production!  I think it's easy to 
not catch this advice.


On 10/20/13 5:55 PM, Shawn Heisey wrote:

On 10/20/2013 2:57 PM, Shawn Heisey wrote:

We recommend jetty.  The solr example uses jetty.


I have a clarification for this statement.  We actually recommend using
the jetty that's included in the Solr 4.x example.  It is stripped of
all unnecessary features and its config has had some minor tuning so
it's optimized for Solr.  The jetty binaries in 4.x are completely
unmodified from the upstream download, we just don't include all of
them.  On the 1.x and 3.x examples, there was a small bug in Jetty 6, so
those versions included modified binaries.

If you download jetty from eclipse.org or install it from your operating
system's repository, it will include components you don't need and its
config won't be optimized for Solr, but it will still be a lot closer to
what's actually tested than tomcat is.

Thanks,
Shawn



solr 4.3, autocommit, maxdocs

2013-07-15 Thread Jonathan Rochkind
I have a solr 4.3 instance I am in the process of standing up. It 
started out with an empty index.


I have in it's solrconfig.xml,

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
  updateHandler

I have an index process running, that has currently added around 400k 
documents to Solr.


I had expected that a 'commit' would be run every 100k documents, from 
the above configuration, so 4 commits would have been run by now, and 
I'd see documents in the index.


However, when I look in the Solr admin interface, at my core's 
'overview' page, it still says num docs 0, segment count 0.  When I 
expected num docs 400k at this point.


Is there something I'm misunderstanding about the configuration or the 
admin interface? Or am I right in my expectations, but something else 
must be going wrong?


Thanks for any advice,

Jonathan


Re: solr 4.3, autocommit, maxdocs

2013-07-15 Thread Jonathan Rochkind
Ah, thanks for this explanation. Although I don't entirely understand 
it, I am glad there is an expected explanation!


This Solr instance is actually set up to be a replication master. It 
never gets searched itself, it just replicates to slaves that get searched.


Perhaps some time in the past (I am migrating from an already set up 
Solr 1.4 instance), I set this value to false, figuring it was not 
neccesary to actually open a searcher, since the master does not get 
searched itself ordinarily.


Despite the opensearcher=false... once committed, are the committed docs 
still going to be sent via replication to a slave, is the index used for 
replication actually changed, even though a searcher hasn't been opened 
to take account of it?  Or will the opensearcher=false keep the commits 
from being seen by replication slaves too?


Thanks for any tips,

Jonathan

On 7/15/13 12:57 PM, Jason Hellman wrote:

Jonathan,

Please note the openSearcher=false part of your configuration.  This is why you 
don't see documents.  The commits are occurring, and being written to segments 
on disk, but they are not visible to the search engine because a Solr searcher 
class has not opened them for visibility.

You can either change the value to true, or alternatively call a deterministic 
commit call at the end of your load (a solr/update?commit=true will default to 
openSearcher=true).

Hope that's of use!

Jason


On Jul 15, 2013, at 9:52 AM, Jonathan Rochkind rochk...@jhu.edu wrote:


I have a solr 4.3 instance I am in the process of standing up. It started out 
with an empty index.

I have in it's solrconfig.xml,

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
  updateHandler

I have an index process running, that has currently added around 400k documents 
to Solr.

I had expected that a 'commit' would be run every 100k documents, from the 
above configuration, so 4 commits would have been run by now, and I'd see 
documents in the index.

However, when I look in the Solr admin interface, at my core's 'overview' page, 
it still says num docs 0, segment count 0.  When I expected num docs 400k at 
this point.

Is there something I'm misunderstanding about the configuration or the admin 
interface? Or am I right in my expectations, but something else must be going 
wrong?

Thanks for any advice,

Jonathan




SolrJ and initializing logger in solr 4.3?

2013-07-11 Thread Jonathan Rochkind

I am using SolrJ in a Java (actually jruby) project, with Solr 4.3.

When I instantiate an HttpSolrServer, I get the dreaded:

log4j:WARN No appenders could be found for logger 
(org.apache.solr.client.solrj.impl.HttpClientUtil).

log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.



Using SolrJ as an embedded library in my own software, what is the 
proper or 'best practice' way -- or failing that, just any way at all -- 
to initialize log4j under Solr 4.3?


I am not super familiar with Java or log4j; hopefully there is an easy 
way to do this?


(If someone has a way especially suited for jruby, even better; but just 
a standard Java answer would be great too.)


Thanks for any advice!


SolrJ 4.3 to Solr 1.4

2013-07-11 Thread Jonathan Rochkind
So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically 
to add documents.


The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that 
this should work, so long as you:


server.setParser(new XMLResponseParser());

However, when I do this, I still get a 
org.apache.solr.common.SolrException: parsing error from 
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143)


(If I _don't_ setParser to XML, and use the binary parser... I get a 
fully expected error about binary format corruption -- that part is 
expected and I understand it, that's why you have to use the 
XMLResponseParser instead).


Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4 
server in pure XML? I've set the parser to the XMLResponseParser, do I 
also have to somehow tell it to actually use the Solr 1.4 XML update 
handler or something?  I don't entirely understand what I'm talking about.


Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk 
to Solr 1.4, is the wiki wrong that this is possible?


Thanks for any help,

Jonathan


Re: SolrJ 4.3 to Solr 1.4

2013-07-11 Thread Jonathan Rochkind

Huh, that might have been a false problem of some kind.

At the moment, it looks like I _do_ have my SolrJ 4.3 succesfully 
talking to a Solr 1.4, so long as I setParser(new XMLResponseParser()).


Not sure what I changed or what wasn't working before, but great!

So nevermind. Although if anyone reading this wants to share any other 
potential gotchas on solrj 4.3 talking to solr 1.4, feel free!


On 7/11/13 4:24 PM, Jonathan Rochkind wrote:

So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically
to add documents.

The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that
this should work, so long as you:

server.setParser(new XMLResponseParser());

However, when I do this, I still get a
org.apache.solr.common.SolrException: parsing error from
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143)


(If I _don't_ setParser to XML, and use the binary parser... I get a
fully expected error about binary format corruption -- that part is
expected and I understand it, that's why you have to use the
XMLResponseParser instead).

Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4
server in pure XML? I've set the parser to the XMLResponseParser, do I
also have to somehow tell it to actually use the Solr 1.4 XML update
handler or something?  I don't entirely understand what I'm talking about.

Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk
to Solr 1.4, is the wiki wrong that this is possible?

Thanks for any help,

Jonathan


Solr, ICUTokenizer with Latin-break-only-on-whitespace

2013-06-20 Thread Jonathan Rochkind

(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E

Which explain a way you can supply custom rulefiles to ICUTokenizer, in 
this case to tell it to only break on whitespace for Latin character 
substrings.


I am trying to use the technique explained there in Solr 4.3, but either 
it's not working, or it's not doing what I'd expect.


I want, for instance, C++ Language to be tokenized into C++, 
Language.  But the ICUTokenizer, even with the 
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi file 
from the Solr 4.3 source [1].


But the ICUTokenizer, even with the that rulefile, is still stripping 
the punctuation, and tokenizing that into C, Language.


Can anyone give me any guidance or hints? I don't entirely understand 
the semantics of the rbbi file to try debugging there. Is something not 
working, or does the rbbi file just not express the semantics I want?


Thanks for any tips.



[1] 
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557view=markup




Re: Solr, ICUTokenizer with Latin-break-only-on-whitespace

2013-06-20 Thread Jonathan Rochkind
Thank you... I started out writing an email with screenshots proving 
that it wasn't working for me in 4.3.0... and of course, having to 
confirm every single detail in order to say I confirmed it... I realized 
it was a mistake on my part, not testing what I thought I was testing.


Does indeed appear to be working now. Thanks! And thanks for this feature.


On 6/20/2013 3:40 PM, Shawn Heisey wrote:

On 6/20/2013 1:26 PM, Jonathan Rochkind wrote:

I want, for instance, C++ Language to be tokenized into C++,
Language.  But the ICUTokenizer, even with the
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi
file from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping
the punctuation, and tokenizing that into C, Language.


This screenshot is using branch_4x downloaded and compiled a couple of
hours ago, with the rbbi file you mentioned copied to the conf directory:

https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png


It shows that the ++ is maintained by the ICU tokenizer. It also
illustrates a UI bug that I will have to show to steffkes where the ++
is lost from the input field after analysis.

Thanks,
Shawn



Solr 4.3, Tomcat, Error filterStart

2013-05-30 Thread Jonathan Rochkind

I am trying to get Solr installed in Tomcat, and having trouble.

I am trying to use the instructions at 
http://wiki.apache.org/solr/SolrTomcat as a guide.  Trying to start with 
the example Solr from the Solr distro. Tried using the Tried with both a 
binary distro with existing solr.war, and with compiling my own solr.war.


* Solr 4.3.0
* Tomcat 6.0.29
* JVM 1.6

When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors


And solr is not actually deployed, naturally.

I've tried to google for advice on this -- mostly what I found was 
suggestions for how to turn up logging to get more info (maybe a stack 
trace?) to give you more clues what's failing -- but nothing I found 
suggested succesfully worked to turn up logging.


So I'm at a bit of a loss. Any suggestions? Any ideas what might be 
causing this error, and/or how to get more information on what's causing it?


Re: Solr 4.3, Tomcat, Error filterStart

2013-05-30 Thread Jonathan Rochkind
Thanks! I guess I should have asked on-list BEFORE wasting 4 hours 
fighting with it myself, but I was trying to be a good user and do my 
homework!  Oh well.


Off to the logging instructions, hope I can figure them out -- if you 
could update the tomcat instructions with the simplest possible way to 
get deploy in Tomcat to work, that'd def be helpful!


On 5/30/2013 10:41 AM, Shawn Heisey wrote:

I am trying to get Solr installed in Tomcat, and having trouble.




When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors




I've tried to google for advice on this -- mostly what I found was
suggestions for how to turn up logging to get more info


In a cruel twist of fate, it is actually logging changes that are
preventing Solr from starting. The required steps for deploying 4.3
changed. I will update the wiki page about tomcat when I'm not on a train.
  See this page for additional instructions, specifically the section about
deploying on containers other than jetty:

http://wiki.apache.org/solr/SolrLogging

Thanks,
Shawn





Re: Solr 4.3, Tomcat, Error filterStart

2013-05-30 Thread Jonathan Rochkind
I'm going to add a note to http://wiki.apache.org/solr/SolrLogging , 
with the Tomcat sample Error filterStart error, as an example of 
something you might see if you have not set up logging.


Then at least in the future, googling solr tomcat error filterStart 
might lead someone to the clue that it might be logging.



On 5/30/2013 10:41 AM, Shawn Heisey wrote:

I am trying to get Solr installed in Tomcat, and having trouble.




When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors




I've tried to google for advice on this -- mostly what I found was
suggestions for how to turn up logging to get more info


In a cruel twist of fate, it is actually logging changes that are
preventing Solr from starting. The required steps for deploying 4.3
changed. I will update the wiki page about tomcat when I'm not on a train.
  See this page for additional instructions, specifically the section about
deploying on containers other than jetty:

http://wiki.apache.org/solr/SolrLogging

Thanks,
Shawn





Re: Solr 4.3, Tomcat, Error filterStart

2013-05-30 Thread Jonathan Rochkind

Okay, sadly, i still can't get this to work.

Following the instructions at:
https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty

I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied 
solr/example/resources/log4j.properties there too.


The result is unchanged, when I start tomcat, it still says:

May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors


This is very frustrating. I have no way to even be sure this problem 
really is logging related, although it seems likely. But I feel like I'm 
just randomly moving chairs around and hoping the error will go away, 
and it does not.


Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can 
we even confirm this is possible?  Can anyone give me any other hints, 
especially does anyone have any idea how to get some more logging out of 
Tomcat, then the fairly useless Error filterSTart?


The only reason I'm using tomcat is that we always have in our current 
Solr 1.4-based application, for reasons lost to time. I was hoping to 
upgrade to Solr 4.3, without simultaneously switching our infrastructure 
from tomcat to jetty, change one thing at a time. I suppose I might need 
to abandon that and switch to jetty too, but I'd rather not.


Re: Solr 4.3, Tomcat, Error filterStart

2013-05-30 Thread Jonathan Rochkind
Okay, for posterity: I did manage to get it working. It WAS lack of the 
logging files.


First, the only way I could manage to get Tomcat6 to log an actual 
stacktrace for the Error filterStart was to _delete_ my 
CATALINA_HOME/conf/logging.properties file.  Apparently without this 
file at all, the default ends up being 'log everything'.


And once that happened, it did confirm that the Error filterStart 
problem WAS an inability to find the logging jars. (And the stack trace 
was an exception from Solr with a nice message including the URL to the 
logging wiki page, nice one solr). Nothing I tried before in a fit of 
desperation deleting that file entirely worked to get the stack trace 
logged.


Once confirmed that the problem really was not finding the logging jars, 
I could keep doing things and restarting and seeing if that was still 
the exception.


And I found that for some reason, despite 
http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html 
suggesting that jars could be found in either CATALINA_BASE/lib (for me 
/opt/tomcat6/lib), OR CATALINA_BASE/lib (for me /usr/share/tomcat6/lib), 
in fact for whatever reason /opt/tomcat6/lib was being ignored, but 
/usr/share/tomcat6/lib worked.


And now I succesfully have solr started in tomcat.

I realize that these are all tomcat6 issues, not solr issues. But others 
trying to get solr started may have similar problems. Appreciate the tip 
that the Error filterStart was probably related to new solr 4.3.0 
logging setup, which ended up confirmed.


Jonathan

On 5/30/2013 3:19 PM, Jonathan Rochkind wrote:

Okay, sadly, i still can't get this to work.

Following the instructions at:
https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty


I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied
solr/example/resources/log4j.properties there too.

The result is unchanged, when I start tomcat, it still says:

May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors


This is very frustrating. I have no way to even be sure this problem
really is logging related, although it seems likely. But I feel like I'm
just randomly moving chairs around and hoping the error will go away,
and it does not.

Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can
we even confirm this is possible?  Can anyone give me any other hints,
especially does anyone have any idea how to get some more logging out of
Tomcat, then the fairly useless Error filterSTart?

The only reason I'm using tomcat is that we always have in our current
Solr 1.4-based application, for reasons lost to time. I was hoping to
upgrade to Solr 4.3, without simultaneously switching our infrastructure
from tomcat to jetty, change one thing at a time. I suppose I might need
to abandon that and switch to jetty too, but I'd rather not.


replication without automated polling, just manual trigger?

2013-05-15 Thread Jonathan Rochkind
I want to set up Solr replication between a master and slave, where no 
automatic polling every X minutes happens, instead the slave only 
replicates on command. [1]


So the basic question is: What's the best way to do that? But I'll 
provide what I've been doing etc., for anyone interested.


Until recently, my appliation was running on Solr 1.4.  I had a setup 
that was working to accomplish this in Solr 1.4, but as I work on moving 
it to Solr 4.3, it's unclear to me if it can/will work the same way.


In Solr 1.4, on slave,  I supplied a masterUrl, but did NOT supply any 
pollInterval at all on slave.  I did NOT supply an enable
false in slave, because I think that would have prevented even manual 
replication.


This seemed to result in the slave never polling, although I'm not sure 
if that was just an accident of Solr implementation or not.  Can anyone 
say if the same thing would happen in Solr 4.3?  If I look at the admin 
screen for my slave set up this way in Solr 4.3, it does say polling 
enabled, but I realize that doesn't neccesarily mean any polling will 
take place, since I've set no pollInterval.


In Solr 1.4 under this setup, I could go to the slave's 
admin/replication, and there was a replicate now button that I could 
use for manually triggered replication.  This button seems to no longer 
be there in 4.3 replication admin screen, although I suppose I could 
still, somewhat less conveniently, issue a 
`replication?command=fetchindex` to the slave, to manually trigger a 
replication?




Thanks for any advice or ideas.



[1]: Why, you ask?  The master is actually my 'indexing' server. Due to 
business needs, indexing only happens in bulk/mass indexing, and only 
happens periodically -- sometimes nightly, sometimes less. So I index on 
master, at a periodic schedule, and then when indexing is complete and 
verified, tell slave to replicate.  I don't want slave accidentally 
replicating in the middle of the bulk indexing process either, when the 
index might be in an unfinished state.


writing a custom Filter plugin?

2013-05-13 Thread Jonathan Rochkind
Does anyone know of any tutorials, basic examples, and/or documentation 
on writing your own Filter plugin for Solr? For Solr 4.x/4.3?


I would like a Solr 4.3 version of the normalization filters found here 
for Solr 1.4: https://github.com/billdueber/lib.umich.edu-solr-stuff


But those are old, for Solr 1.4.

Does anyone have any hints for writing a simple substitution Filter for 
Solr 4.x?  Or, does a simple sourcecode example exist anywhere?


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Jonathan Rochkind
When I do things like this and want to avoid empty tokens even though 
previous analysis might result in some--I just throw one of these at the 
end of my analysis chain:


!-- get rid of empty string tokens. max is required, although
 we don't really care. --
filter class=solr.LengthFilterFactory min=1 max=/

A charfilter to filter raw characters can certainly still result in an 
empty token, if an initial token was composed solely of chars you wanted 
to filter out!  In which case you probably want the token to be deleted 
entirely, not still there as an empty token. The above length filter is 
one way to do that, although unfortunately requires specifying a 'max' 
even though I didn't actually want to filter out on the high end, oh well.



On 9/24/2012 1:07 PM, Jack Krupansky wrote:

I tried it and PRFF is indeed generating an empty token. I don't know
how Lucene will index or query an empty term. I mean, what it should
do. In any case, it is best to avoid them.

You should be using a charFilter to simply filter raw characters
before tokenizing. So, try:

charFilter class=solr.PatternReplaceCharFilterFactory/

It has the same pattern and replacement attributes.

-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, September 24, 2012 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex \p{Punct}:
POSIX character classes (US-ASCII only), so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your {
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-Original Message- From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr - Remove specific punctuation marks

I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

but still the numfound gives 1

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to exactly match fields which are multi-valued?

2012-03-08 Thread Jonathan Rochkind
Well, if you really want EXACT exact, just use a KeywordTokenizer (ie, 
not tokenize at all). But then matches will really have to be EXACT, 
including punctuation, whitespace, diacritics, etc.  But a query will 
only match if it 'exactly' matches one value in your multi-valued field.


You could try a KeywordTokenizer with some normalization too.

Either way, though, if you're issuing a query to a field tokenized with 
KeywordTokenizer that can include whitespace in it's values, you really 
need to issue it as a _phrase query_, to avoid being messed up by the 
lucene or dismax query parser's pre tokenization.  Which is 
potentially fine, that's what you want to do anyway for 'exact match'.  
Except if you wanted to use dismax multiple qf's with just a BOOST on 
the 'exact match', but _not_ a phrase query for other fields... well, I 
can't figure out any way to do it with this technique.


It gets tricky, I haven't found a great solution.

On 3/8/2012 7:44 AM, Erick Erickson wrote:

You haven't really given us much to go on here. Matches
are just like a single valued field with the exception of
the increment gap. Say one entry were
large cat big dog
in a multi-valued field. ay the next document
indexed two values,
large cat
big dog

And, say the increment gap were 100. The token offsets
for doc 1 would be
0, 1, 2, 3
and for doc 2 would be
0, 1, 101, 102

The only effective difference is that phrase queries with slop
less than 100 would NEVER match across multi-values. I.e.
cat big~10 would match doc1 but not doc 2

Best
Erick

2012/3/7 SuoNayisuonayi2...@163.com:

Hi all, how to offer exact-match capabilities on the multi-valued fields?

Any helps are appreciated!

SuoNayi


Re: need to support bi-directional synonyms

2012-02-23 Thread Jonathan Rochkind

Honestly, I'd just map em both the same thing in the index.

sprayer, washer = sprayer

or

sprayer, washer = sprayer_washer

At both index and query time. Now if the source document includes either 
'sprayer' or 'washer', it'll get indexed as 'sprayer_washer'.  And if 
the user enters either 'sprayer' or 'washer', it'll search the index for 
'sprayer_washer', and find source documents that included either 
'sprayer' or 'washer'.


Of course, if you really use sprayer_washer, then if the user actually 
enters sprayer_washer they'll also find sprayer, washer, and 
sprayer_washer.


So it's probably best to actually use either 'sprayer' or 'washer' as 
the destination, even though it seems odd:


sprayer, washer = washer

Will do what you want, pretty sure.

On 2/23/2012 1:03 AM, remi tassing wrote:

Same question here...

On Wednesday, February 22, 2012, geeky2gee...@hotmail.com  wrote:

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer.

would i set up my synonym file like this?

assuming expand = true..

sprayer =  washer
washer =  sprayer

thank you,
mark

--
View this message in context:

http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html

Sent from the Solr - User mailing list archive at Nabble.com.



Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Jonathan Rochkind
So I don't really know what I'm talking about, and I'm not really sure 
if it's related or not, but your particular query:


The Beatles as musicians : Revolver through the Anthology

With the lone word that's a ':', reminds me of a dismax stopwords-type 
problem I ran into. Now, I ran into it on 1.4.  I don't know why it 
would be different on 1.4 and 3.x. And I see you aren't even using a 
multi-field dismax in your sample query, so it couldn't possibly be what 
I ran into... I don't think. But I'll write this anyway in case it gives 
someone some ideas.


The problem I ran into is caused by different analysis in two fields 
both used in a dismax, one that ends up keeping : as a token, and one 
that doesn't.  Which ends up having the same effect as the famous 
'dismax stopwords problem'.


Maybe somehow your schema changed such to produce this problem in 3.x 
but not in 1.4? Although again I realize the fact that you are only 
using a single field in your demo dismax query kind of suggests it's not 
this problem. Wonder if you try the query without the :, if the 
problem goes away, that might be a hint. Or, maybe someone more skilled 
at understanding what's in those Solr debug statements than I am (it's 
kind of all greek to me) will be able to take this hint and rule out or 
confirm that it may have something to do with your problem.


Here I write up the issue I ran into (which may or may not have anything 
to do with what you ran into)


http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/


Also, you don't say what your 'mm' is in your dismax queries, that could 
be relevant if it's got anything to do with anything similar to the 
issue I'm talking about.


Hmm, I wonder if Solr 3.x changes the way dismax calculates number of 
tokens for 'mm' in such a way that the 'varying field analysis dismax 
gotcha' can manifest with only one field, if the way dismax counts 
tokens for 'mm' differs from number of tokens the single field's 
analysis produces?


Jonathan

On 2/22/2012 2:55 PM, Naomi Dushay wrote:

I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   I 
have a test checking for a search result in Solr, and the test passes in Solr 
1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I just 
included output from lucene QueryParser to prove the document exists and is 
found

I am completely stumped.


Here are the debugQuery details:

***Solr 3.5***

lucene QueryParser:

URL:   q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
antholog), product of:
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.02063975 = queryNorm
   6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
 1.0 = tf(phraseFreq=1.0)
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.125 = fieldNorm(field=all_search, doc=1064395)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:   +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

(no matches)


***Solr 1.4***

lucene QueryParser:

URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
   0.109375 = fieldNorm(field=all_search, doc=3469163)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:  +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

score:

7.449651 = (MATCH) sum of:
   3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~1 in 3469163), product of:
 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the 
antholog~1), product of:
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 
musician=11955 revolv=820 through=88238 the=3542123 antholog=11205)
   0.014681898 = queryNorm
 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)

Re: replication, disk space

2012-01-19 Thread Jonathan Rochkind

Thanks for the response. I am using Linux (RedHat).

It sounds like it may possibly be related to that bug.

But the thing is, the timestamped index directory is looking to me like 
it's the _current_ one, with the non-timestamped one being an old out of 
date one.  So that does not seem to be quite the same thing reported in 
that bug, although it may very well be related.


At this point, I'm just trying to figure out how to clean up.  How to 
verify which of those copies really is the current one, which is 
currently being used by Solr -- and if it's the timestamped one, how to 
restore things to the state where there's only one non-timestamped index 
dir, ideally without downtime to Solr.


Anyone have any advice or ideas on those questions?

On 1/18/2012 1:23 PM, Artem Lokotosh wrote:

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
replication, it only replicates irregularly when I issue a replicate command
to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently being
used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around indefinitely.
  After replication completes and is committed, why would two index dirs be
left?  And how can I restore this to one index dir, without downtime? If
it's really using the index.X directory, then I could just delete the
index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And if it's
using the timestamped index file now, I can't delete THAT one now either.

If I was willing to restart the tomcat container, then I could delete one,
rename the other, etc. But I don't want downtime.

I really don't understand what's going on or how it got in this state. Any
ideas?

Jonathan






Re: replication, disk space

2012-01-19 Thread Jonathan Rochkind
Hmm, I don't have a replication.properties file, I don't think. Oh 
wait, yes I do there it is!  I guess the replication process makes this 
file?


Okay

I don't see an index directory in the replication.properties file at all 
though. Below is my complete replication.properties.


So I'm still not sure how to properly recover from this situation 
withotu downtime. It _looks_ to me like the timestamped directory is 
actually the live/recent one.  It's files have a more recent timestamp, 
and it's the one that /admin/replication.jsp mentions.


replication.properties:

#Replication details
#Wed Jan 18 10:58:25 EST 2012
confFilesReplicated=[solrconfig.xml, schema.xml]
timesIndexReplicated=350
lastCycleBytesDownloaded=6524299012
replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902
replicationFailedAt=1326902305288
timesConfigReplicated=1
indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294
confFilesReplicatedAt=1316547200637
previousCycleTimeInSeconds=295
timesFailed=54
indexReplicatedAt=1326902305288
~


On 1/18/2012 1:41 PM, Dyer, James wrote:

I've seen this happen when the configuration files change on the master and replication deems it necessary to 
do a core-reload on the slave. In this case, replication copies the entire index to the new directory then 
does a core re-load to make the new config files and new index directory go live.  Because it is keeping the 
old searcher running while the new searcher is being started, both index copies to exist until the swap is 
complete.  I remember having the same concern about re-starts, but I believe I tested this and solr will look 
at the replication.properties file on startup and determine the correct index dir to use from 
that.  So (If my memory is correct) you can safely delete index so long as 
replication.properties points to the other directory.

I wasn't familiar with SOLR-1781.  Maybe replication is supposed to clean up the extra 
directories and doesn't sometimes?  In any case, I've found whenever it happens its ok to 
go out and delete the one(s) not being used, even if that means deleting 
index.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Artem Lokotosh [mailto:arco...@gmail.com]
Sent: Wednesday, January 18, 2012 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: replication, disk space

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
replication, it only replicates irregularly when I issue a replicate command
to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently being
used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around indefinitely.
  After replication completes and is committed, why would two index dirs be
left?  And how can I restore this to one index dir, without downtime? If
it's really using the index.X directory, then I could just delete the
index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And if it's
using the timestamped index file now, I can't delete THAT one now either.

If I was willing to restart the tomcat container, then I could delete one,
rename the other, etc. But I don't want downtime.

I really don't understand what's going on or how it got in this state. Any
ideas?

Jonathan






Re: replication, disk space

2012-01-19 Thread Jonathan Rochkind

On 1/18/2012 1:53 PM, Tomás Fernández Löbbe wrote:

As far as I know, the replication is supposed to delete the old directory
index. However, the initial question is why is this new index directory
being created. Are you adding/updating documents in the slave? what about
optimizing it? Are you rebuilding the index from scratch in the master?


Thanks for the response. Not adding/updating in slave. Not optimizing in 
slave. YES sometimes rebuilding index from scratch in master.


I am on Linux, RedHat 5.

This server has also been occasionally been having out-of-disk problems, 
which caused some replications to fail, an aborted replication could 
also possibly account for the extra index directory, perhaps? (It now 
has enough disk space to avoid that problem).


At this point, my main concern is getting things back in an expected 
stable state at this point, eliminating the extra index dir, ideally 
without downtime.


Re: replication, disk space

2012-01-19 Thread Jonathan Rochkind
Okay, I do have an index.properties file too, and THAT one does contain 
the name of an index directory.


But it's got the name of the timestamped index directory!  Not sure how 
that happened, could have been Solr trying to recover from running out 
of disk space in the middle of a replication? I certainly never did that 
intentionally.


But okay, if someone can confirm if this plan makes sense to restore 
things without downtime:


1. rm the 'index' directory, which seems to be an old copy of the index 
at this point

2. 'mv index.20120113121302 index'
3. Manually edit index.properties to have index=index, not 
index=index.20120113121302

4. Send reload core command.

Does this make sense?  (I just experimentally tried an reload core 
command, and even though it's not supposed to, it DID result in about 20 
seconds of unresponsiveness from my solr server, not sure why, could 
just be lack of CPU or RAM on the server to do what's being asked of it. 
But if that's the best I can do, 20 minutes of unavailability, I'll take 
it).


On 1/19/2012 12:37 PM, Jonathan Rochkind wrote:
Hmm, I don't have a replication.properties file, I don't think. Oh 
wait, yes I do there it is!  I guess the replication process makes 
this file?


Okay

I don't see an index directory in the replication.properties file at 
all though. Below is my complete replication.properties.


So I'm still not sure how to properly recover from this situation 
withotu downtime. It _looks_ to me like the timestamped directory is 
actually the live/recent one.  It's files have a more recent 
timestamp, and it's the one that /admin/replication.jsp mentions.


replication.properties:

#Replication details
#Wed Jan 18 10:58:25 EST 2012
confFilesReplicated=[solrconfig.xml, schema.xml]
timesIndexReplicated=350
lastCycleBytesDownloaded=6524299012
replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902 


replicationFailedAt=1326902305288
timesConfigReplicated=1
indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294 


confFilesReplicatedAt=1316547200637
previousCycleTimeInSeconds=295
timesFailed=54
indexReplicatedAt=1326902305288
~


On 1/18/2012 1:41 PM, Dyer, James wrote:
I've seen this happen when the configuration files change on the 
master and replication deems it necessary to do a core-reload on the 
slave. In this case, replication copies the entire index to the new 
directory then does a core re-load to make the new config files and 
new index directory go live.  Because it is keeping the old searcher 
running while the new searcher is being started, both index copies to 
exist until the swap is complete.  I remember having the same concern 
about re-starts, but I believe I tested this and solr will look at 
the replication.properties file on startup and determine the 
correct index dir to use from that.  So (If my memory is correct) you 
can safely delete index so long as replication.properties points 
to the other directory.


I wasn't familiar with SOLR-1781.  Maybe replication is supposed to 
clean up the extra directories and doesn't sometimes?  In any case, 
I've found whenever it happens its ok to go out and delete the one(s) 
not being used, even if that means deleting index.


James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Artem Lokotosh [mailto:arco...@gmail.com]
Sent: Wednesday, January 18, 2012 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: replication, disk space

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  
wrote:
So Solr 1.4. I have a solr master/slave, where it actually doesn't 
poll for
replication, it only replicates irregularly when I issue a replicate 
command

to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently 
being

used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around 
indefinitely.
  After replication completes and is committed, why would two index 
dirs be
left?  And how can I restore this to one index dir, without 
downtime? If
it's really using the index.X directory, then I could just 
delete the

index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And 
if it's
using the timestamped index file now, I can't delete THAT one now 
either.


If I was willing

replication, disk space

2012-01-18 Thread Jonathan Rochkind
So Solr 1.4. I have a solr master/slave, where it actually doesn't poll 
for replication, it only replicates irregularly when I issue a replicate 
command to it.


After the last replication, the slave, in solr_home, has a data/index 
directory as well as a data/index.20120113121302 directory.


The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently 
being used live, not the straight 'index'? Why?


I can't afford the disk space to leave both of these around 
indefinitely.  After replication completes and is committed, why would 
two index dirs be left?  And how can I restore this to one index dir, 
without downtime? If it's really using the index.X directory, then 
I could just delete the index directory, but that's a bad idea, 
because next time the server starts it's going to be looking for 
index, not index..  And if it's using the timestamped index file 
now, I can't delete THAT one now either.


If I was willing to restart the tomcat container, then I could delete 
one, rename the other, etc. But I don't want downtime.


I really don't understand what's going on or how it got in this state. 
Any ideas?


Jonathan



replication failure, logs or notice?

2012-01-12 Thread Jonathan Rochkind
I think maybe my Solr 1.4 replications have been failing for quite some 
time, without me realizing it, possibly due to lack of disk space to 
replicate some large segments.


Where would I look to see if a replication failed? Just the standard 
solr log?  What would I look for?


There's no facility to have, like an email sent if replication fails or 
anything, is there?


I realize that Solr/java logging is something that still confuses me, 
I've done whatever was easiest, but I'm vaguely remembering now that by 
picking the right logging framework and configuring it properly, maybe 
you can send different types of events to different logs, like maybe 
replication events to their own log? Is this a thing?


Thanks for any ideas,

Jonathan




Re: changing omitNorms on an already built index

2011-11-07 Thread Jonathan Rochkind

On 10/27/2011 9:14 PM, Erick Erickson wrote:

Well, this could be explained if your fields are very short. Norms
are encoded into (part of?) a byte, so your ranking may be unaffected.

Try adding debugQuery=on and looking at the explanation. If you've
really omitted norms, I think you should see clauses like:

1.0 = fieldNorm(field=features, doc=1)
in the output, never something like


Thanks, this was very helpful. Indeed with debugQuery on, I get 1.0 = 
fieldNorm on my index with omitNorms for the relevant field, and in my 
index without omitNorms for the relevant field, I get a non-unit value 
= fieldNorm, thanks for giving me a way to reassure myself that 
omitNorms really is doing it's thing.


Now to dive into my debugQuery and figure out why it doesn't seem to be 
having as much effect as I anticipated on relevance!





changing omitNorms on an already built index

2011-10-27 Thread Jonathan Rochkind
So Solr 1.4.  I decided I wanted to change a field to have 
omitNorms=true that didn't previously.


So I changed the schema to have omitNorms=true.  And I reindexed all 
documents.


But it seems to have had absolutely no effect. All relevancy rankings 
seem to be the same.


Now, I could have a mistake somewhere else, maybe I didn't do what I 
thought.


But I'm wondering if there are any known issues related to this, is 
there something special you have to do to change a field from 
omitNorms=false to omitNorms=true on an already built index?  Other than 
re-indexing everything?Any known issues relevant here?


Thanks for any help,

Jonathan


Re: Questions about LocalParams syntax

2011-09-20 Thread Jonathan Rochkind
I don't have the complete answer. But I _think_ if you do one 'bq' param 
with multiple space-seperated directives, it will work.


And escaping is a pain.  But can be made somewhat less of a pain if you 
realize that single quotes can sometimes be used instead of 
double-quotes. What I do:


_query_:{!dismax qf='title something else'}

So by switching between single and double quotes, you can avoid need to 
escape. Sometimes you still do need to escape when a single or double 
quote is actually in a value (say in a 'q'), and I do use backslash 
there. If you had more levels of nesting though... I have no idea what 
you'd do.


I'm not even sure why you have the internal quotes here:

bq=\format:\\\Book\\\^50\


Shouldn't that just be bq='format:Book^50', what's the extra double 
quotes around Book?  If you don't need them, then with switching 
between single and double, this can become somewhat less crazy and error 
prone:


_query_:{!dismax bq='format:Book^50'}

I think. Maybe. If you really do need the double quotes in there, then I 
think switching between single and double you can use a single backslash 
there.



On 9/20/2011 9:39 AM, Demian Katz wrote:

I'm using the LocalParams syntax combined with the _query_ pseudo-field to 
build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm 
running into some syntax questions that don't seem to be addressed by the wiki 
page here:

http://wiki.apache.org/solr/LocalParams


1.)How should I deal with repeating parameters?  If I use multiple boost 
queries, it seems that only the last one listed is used...  for example:

((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ 
bq=\format:Journal^150\}test))

 boosts Journals, but not Books.  If I reverse the order of the 
two bq parameters, then Books get boosted instead of Journals.  I can work 
around this by creating one bq with the clauses OR'ed together, but I would 
rather be able to apply multiple bq's like I can elsewhere.


2.)What is the proper way to escape quotes?  Since there are multiple 
nested layers of double quotes, things get ugly and it's easy to end up with 
syntax errors.  I found that this syntax doesn't cause an error:


((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:\\\Book\\\^50\ 
bq=\format:\\\Journal\\\^150\}test))

 ...but it also doesn't work correctly - the boost queries are 
completely ignored in this example.  Perhaps this is more a problem related to  
_query_ than to LocalParams syntax...  but either way, a solution would be 
great!

thanks,
Demian



Re: XML injection interface in select servlet?

2011-09-20 Thread Jonathan Rochkind

On Sep 20, 2011, at 04:33 , Jan Peter Stotz wrote:



I am now asking myself why would someone implement such a bloodcurdling
vulnerability into a web service? Until now I haven't found an exploit
using the parameters in a way an attacker would get an advantage. But the
way those parameters are implemented raise some doubts on my side if
security has been seriously taken into account while implementing Solr...


Solr committers can correct me if I'm wrong, but my impression is that 
the Solr API itself is generally _not_ intended to be exposed to the 
world. It's expected to be protected behind a firewall, accessed by 
trusted applications.


People periodically post to this list planning on exposing it to the 
world anyway; but my impression is there may be all kinds of security 
problems there, as well as DoS possibilities, etc.


So I think it may be safe to say that security has not been seriously 
taken into account -- if you mean security on a Solr instance which has 
it's entire API exposed publically to the world.  I don't think that's 
the intended use case.


Re: JSON indexing failing...

2011-09-19 Thread Jonathan Rochkind
So I'm not an expert in the Solr JSON update message, never used it 
before myself. It's documented here:


http://wiki.apache.org/solr/UpdateJSON

But Solr is not a structured data store like mongodb or something; you 
can send it an update command in JSON as a convenience, but don't let 
that make you think it can store arbitrarily nested structured data like 
mongodb or couchdb or something.


Solr has a single flat list of indexes, as well as stored fields which 
are also a single flat list per-document. You can format your update 
message as JSON in Solr 3.x, but you still can't tell it to do something 
it's incapable of. If a field is multi-valued, according to the 
documentation, the json value can be an array of values. But if the JSON 
value is a hash... there's nothing Solr can do with this, it's not how 
solr works.



It looks from the documentation that the value can sometimes be a hash 
when you're communicating other meta-data to Solr, like field boosts:


my_boosted_field: {/* use a map with boost/value for a 
boosted field */

  boost: 2.3,
  value: test
},

But you can't just give it arbitrary JSON, you have to give it JSON of 
the sort it expects. Which does not include arbitrarily nested data hashes.


Jonathan



Re: query for point in time

2011-09-15 Thread Jonathan Rochkind
You didn't tell us what your schema looks like, what fields with what 
types are involved.


But similar to how you'd do it in your database, you need to find 
'documents' that have a start date before your date in question, and an 
end date after your date in question, to find the ones whose range 
includes your date in question.


Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about 
'John Smith', through another AND clause or an 'fq'.


But in general, if you've got a db with this info already, and this is 
all you need, why not just use the db?  Multi-hieararchy data like this 
is going to give you trouble in Solr eventually, you've got to arrange 
the solr indexes/schema to answer your questions, and eventually you're 
going to have two questions which require mutually incompatible schema 
to answer.


An rdbms is a great general purpose question answering tool for 
structured data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks



Re: query for point in time

2011-09-15 Thread Jonathan Rochkind

I think there's something wrong with your database then, but okay.

You still haven't said what your Solr schema looks like -- that list of 
values doesn't say what the solr field names or types are. I think this 
is maybe because you don't actually have a Solr database and have no 
idea how Solr works, you're just asking in theory? On the other hand, 
you just said you have better performance with solr -- I'm not sure how 
you were able to tell the performance of solr in answering these queries 
if you don't even know how to make them!


But, again, assuming your data is set up like i'm guessing it is, it's 
quite similar to what you'd do with an rdbms.


What does 'most current' mean? Can jobs be overlapping? To find the 
project with the latest start date for a given person, just limit to 
documents with that current person in a 'q' or 'fq', and then sort by 
start_date desc. Perhaps limit to 1 if you really only want one hit.  
Same principle as you would in an rdbms.


Again, this requires setting up your solr index in such a way to answer 
these sorts of questions. Each document in Solr will represent a 
person-project pair.  It'll have fields for person (or multiple fields, 
personID, personFirst, personLast, etc), project name, project start 
date, project end date.  This will make it easy/possible to answer 
questions like your examples with Solr, but will make it hard to answer 
many other sorts of questions -- unlike an rdbms, it is difficult to set 
up a Solr index that can flexibly answer just about any question you 
through at it, particularly when you have hieararchical or otherwise 
multi-entity data.


If you are interested, the standard Solr tutorial is pretty good: 
http://lucene.apache.org/solr/tutorial.html





On 9/15/2011 6:39 PM, gary tam wrote:

Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
  unit testing
  2010-01-04   2010-01-06
  QA support
2010-01-07   2010-01-12
  implementation   2010-01-13
2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
  implementation   2010-05-02
2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
  Java Development  2011-09-01
 2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
   2010-01-22
Jane Doe  department Aimplementation  2010-05-02
   2010-09-28
Joe Doedepartment AJava Development  2011-09-01
   2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


You didn't tell us what your schema looks like, what fields with what types
are involved.

But similar to how you'd do it in your database, you need to find
'documents' that have a start date before your date in question, and an end
date after your date in question, to find the ones whose range includes your
date in question.

Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about
'John Smith', through another AND clause or an 'fq'.

But in general, if you've got a db with this info already, and this is all
you need, why not just use the db?  Multi-hieararchy data like this is going
to give you trouble in Solr eventually, you've got to arrange the solr
indexes/schema to answer your questions, and eventually you're going to have
two questions which require mutually incompatible schema to answer.

An rdbms is a great general purpose question answering tool for structured
data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:


Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for
project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks




RE: need some guidance about how to configure a specific solr solution.

2011-08-12 Thread Jonathan Rochkind
I don't know anything about LifeRay (never heard of it), but it sounds like 
you've actually figured out what you need to know about LifeRay, all you've got 
left is: how to replicate the writer solr server content into the readers.

This should tell you how: 
http://wiki.apache.org/solr/SolrReplication

You'll need to find and edit the configuration files for the Solr's involved -- 
if you don't normally do that because LifeRay hides em from you, you'll need to 
find em. But it's a straightforward Solr feature (since 1.4), replication. 

From: Roman, Pablo [pablo.ro...@uhn.ca]
Sent: Thursday, August 11, 2011 12:10 PM
To: solr-user@lucene.apache.org
Subject: need some guidance about how to configure a specific solr solution.

Hi There,

I am IT and  work on a project based on Liferary 605 with solr-3.2 like the 
indexer/search engine.

I presently have only one server that is indexing and searching but reading the 
Liferay Support suggestions they point to the need of having:
- 2 to n SOLR read-server for searching from any member of the liferay cluster
- 1 SOLR write-server where all liferay cluster members write.

However, going down to detail to implement that on the liferay side I think I 
know how to do that which is inserting into the plugin for Solr this entries

 solr-spring.xml in the WEB-INF/classes/META-INF folder. Open this file in a 
text editor and you will see that there are two entries which define where the 
Solr server can be found by Liferay:

bean id=indexSearcher 
class=com.liferay.portal.search.solr.SolrIndexSearcherImpl property 
name=serverURL value=http://localhost:8080/solr/select; / /bean bean 
id=indexWriter class=com.liferay.portal.search.solr.SolrIndexWriterImpl 
property name=serverURL value=http://localhost:8080/solr/update; / /bean

However, I don't know how to replicate the writer solr server content into the 
readers. Please can you provide advice about that?

Thanks,
Pablo

This e-mail may contain confidential and/or privileged information for the sole 
use of the intended recipient.
Any review or distribution by anyone other than the person for whom it was 
originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and delete 
all copies.
Opinions, conclusions or other information contained in this e-mail may not be 
that of the organization.


RE: paging size in SOLR

2011-08-10 Thread Jonathan Rochkind
I would imagine the performance penalties with deep paging will ALSO be there 
if you just ask for 1 rows all at once though, instead of in, say, 100 row 
paged batches. Yes? No?

-Original Message-
From: simon [mailto:mtnes...@gmail.com] 
Sent: Wednesday, August 10, 2011 10:44 AM
To: solr-user@lucene.apache.org
Subject: Re: paging size in SOLR

Worth remembering there are some performance penalties with deep
paging, if you use the page-by-page approach. may not be too much of a
problem if you really are only looking to retrieve 10K docs.

-Simon

On Wed, Aug 10, 2011 at 10:32 AM, Erick Erickson
erickerick...@gmail.com wrote:
 Well, if you really want to you can specify start=0 and rows=1 and
 get them all back at once.

 You can do page-by-page by incrementing the start parameter as you
 indicated.

 You can keep from re-executing the search by setting your queryResultCache
 appropriately, but this affects all searches so might be an issue.

 Best
 Erick

 On Wed, Aug 10, 2011 at 9:09 AM, jame vaalet jamevaa...@gmail.com wrote:
 hi,
 i want to retrieve all the data from solr (say 10,000 ids ) and my page size
 is 1000 .
 how do i get back the data (pages) one after other ?do i have to increment
 the start value each time by the page size from 0 and do the iteration ?
 In this case am i querying the index 10 time instead of one or after first
 query the result will be cached somewhere for the subsequent pages ?


 JAME VAALET




Re: Remote backup of Solr index over low-bandwith connection

2011-08-09 Thread Jonathan Rochkind
You can use rsync to automatically only transfer the files that have 
changed. I don't think you'll have to home grow your own 'only transfer 
the diffs' solution, I think rsync will do that for you.


But yes, running an optimization, after many updates/deletes, will 
generally mean nearly everything has changed.


Solr's index, of course _is_ lucene, so your experience with lucene will 
be applicable to Solr.  Unless lucene or Solr have added new features 
since you last used it, but you're still using lucene, when you're using 
Solr.


On 8/9/2011 11:22 AM, Peter Kritikos wrote:

Hello, everyone,

My company will be using Solr on the server appliance we deliver to 
our clients. We would like to maintain remote backups of clients' 
search indexes to avoid rebuilding a large index when an appliance fails.


One of our clients backs up their data onto a remote server provided 
by a vendor which only provides storage space, so I don't believe it 
is possible for us to set up a remote slave server to use Solr's 
replication functionality. Because our client has a low-bandwidth 
connection to their backup server, we would like to minimize the 
amount of data transferred to the remote machine. Our Solr index 
receives commits every few minutes and will probably be optimized 
roughly once a day. Does our frequently modified index allow us to 
transfer an amount of data proportional to the number of new documents 
added to the search index daily? From my understanding, optimizing an 
index makes very significant changes to its files. Is there a way 
around this that I may be missing?


We have faced this problem in the past when our product used a 
Lucene-based search engine. We were unable to find a solution where we 
could only copy the diffs introduced to the index since the most 
recent backup, so we opted to make our indexing process faster. In 
addition to plain text, many of the documents that we are indexing are 
binary, e.g. Word, PDF. We cached the extracted text from these binary 
documents on the clients' backup servers, saving us the cost of 
extraction at index time. If we must pursue a solution like this for 
Solr, how else might we optimize the indexing process?


Much appreciated,
Peter Kritikos




RE: Multiple Cores on different machines?

2011-08-09 Thread Jonathan Rochkind
 tables. Others are suggesting 2 separate indexes on 2 different machines and
 using SOLRs capacity to combine cores and generate a third index that
 denormalizes the tables for us.

What capability is that, exaclty?  I think you may be imagining it. 

Solr does have some capability to distribute a single logical index across 
several different servers (sharding) -- this feature is mainly intended for 
scaling/performance, when your index gets too big for one server.  

I am not quite sure why it's so popular for people to come to the list trying 
to use sharding (or a mythical 'capacity to combine cores' which isn't quite 
the same thing) for entirley other problems, but it usually leads to pain. 

What problem is it you are trying to solve by splitting things into separate 
indexes on two differnet machines, and then later generating a third index 
aggregating the two indexes?  

I suppose you _could_ do that, first index into two separate indexes, and then 
have an indexer which reads from both of those two indexes, and adds to a third 
index.  But it wouldn't be using any 'capacity to combine cores' -- and  I 
don't believe there is any such 'capacity to combine cores' in such a way to 
somehow automatically build a third index from two source indexes with an 
entirely different schema that somehow manages to 'denormalize' the two source 
indexes. 

What are you trying to accomplish that makes you imagine this?

Re: Weighted facet strings

2011-08-08 Thread Jonathan Rochkind
One kind of hacky way to accomplish some of those tasks involves 
creating a lot more Solr fields. (This kind of 'de-normalization' is 
often the answer to how to make Solr do something).


So facet fields are ordinarily not tokenized or normalized at all. But 
that doesn't work very well for matching query terms.  So if you want 
actual queries to match on these categories, you probably want an 
additional field that is tokenized/analyzed.  If you want to boost 
different category assignments differently, you probably want _multiple_ 
additional tokenized/analyzed fields.


So for instance, create separate analyzed fields for each category 
'weight', perhaps using the default 'text' analysis type.


categor_text_weight_1
category_text_weight_2
etc

Then use dismax to query, include all those category_text_* fields in 
the 'qf', and boost the higher weight ones more than the lower weight ones.


That will handle a number of your use cases, but not all of them.

Your first two cases are the most problematic:

filter: category=some_category_name, query: *.* - Results should be 
score by the above mentioned weight 


So Solr doesn't really work like that. Normally a filter does not effect 
the scoring of the actual results _at all_. But if you change the query to:


fq=category:some_category
q=some_category
defType=dismax
qf=category_text_weight1, category_text_weight2^10, 
category_text_weight3^20


THEN, with the multiple analyzed category_text_weight_* fields, as 
described above, I think it should do what you want. You may have to 
play with exactly what boost to give to each field.


But your second use case is still tricky.

Solr doesn't really do exactly what you ask, but by using this method I 
think you can figure out hacky ways to accomplish it.  I'm not sure if 
it will solve all of your use cases, but maybe this will give you a 
start to figuring it out.



On 8/5/2011 6:55 AM, Michael Lorz wrote:

Hi all,

I have documents which are (manually) tagged whith categories. Each
category-document relation has a weight between 1 and 5:

5: document fits perfectly in this category,
.
.
1: document may be considered as belonging to this category.


I would now like to use this information with solr. At the moment, I don't use
the weight at all:

field name=category type=string indexed=true stored=true
multiValued=true/

Both the category as well as the document body are specified as query fields
(str name=qf  in solrconfig.xml).


What I would like is the following:

- filter: category=some_category_name, query: *.*  - Results should be score by
the above mentioned weight
- filter: category=some_category_name, query: some_keyword - Results should be
scored by a combination of the score of 'some_keyword' and the above mentioned
weight
- filter: none, query: some_category_name - Documents with category
'some_category_name' should be found as well as documents which contain the term
'some_category_name'. Results should be scored by a combination of the score of
'some_keyword' and the above mentioned weight


Do you have any ideas how this could be done?

Thanks in advance
Michi


Re: Dispatching a query to multiple different cores

2011-08-08 Thread Jonathan Rochkind
However, if you unify your schemas to do this, I'd consider whether you 
really want seperate cores/shards in the first place.


If you want to search over all of them together, what are your reasons 
to put them in seperate solr indexes in the first place?  Ordinarily, if 
you want to search over them all together, the best place to start is 
putting them in the same solr index.


Then, the distribution/sharding feature is generally your next step, 
only if you have so many documents that you need to shard for 
performance reasons. That is the intended use case of the 
distribution/sharding feature.


On 8/8/2011 4:54 PM, Erik Hatcher wrote:

You could use Solr's distributed (shards parameter) capability to do this.  
However, if you've got somewhat different schemas that isn't necessarily going 
to work properly.  Perhaps unify your schemas in order to facilitate this using 
Solr's distributed search feature?

Erik

On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote:


Hello there!

I have a multicore solr with 6 different simple cores and somewhat
different schemas and I defined another meta core which I would it to be a
dispatcher:  the requests are sent to simple cores and results are
aggregated before sending back the results to the user.

Any idea or hints how can I achieve this?
I am wondering whether writing custom SearchComponent or a custom
SearchHandler are good entry points?
Is it possible to acces other SolrCore which are in the same container as
the meta core?

Many thanks for your help.

Boubaker




Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Jonathan Rochkind

Dismax queries can. But

sort=termfreq(all_lists_text,'indie+music')

is not using dismax.  Apparenty termfreq function can not? I am not familiar 
with the termfreq function.

To understand why you'd need to reindex, you might want to read up on how 
lucene actually works, to get a basic understanding of how different indexing 
choices effect what is possible at query time. Lucene In Action is a pretty 
good book.



On 8/8/2011 5:02 PM, Jason Toy wrote:

Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote:


Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its
because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..


All the results don't have the phrase indie music anywhere in their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how
this function parses your input but it might not understand your + escape
and
think it's one term constisting of exactly that.


If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the
shingle filter. Take care, it can significantly increase your index size
and
number of unique terms.




On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:

You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com


I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to

do

it without passing in data to the q paramater.  Is it possible to get
a

sorted


list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533





Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Jonathan Rochkind

On 8/8/2011 5:10 PM, Markus Jelsma wrote:

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?

If i remember correctly, it cannot.


Well, if you index things properly, you could an fq to only certain 
months, and then use StatsComponent on top.


But I'd agree with others that Solr is probably not the best tool for 
this job. Solr's primary area of competency is text indexing and text 
search, not mathematical calculation. If you need a whole lot of text 
indexing and a little bit of math too, you might be able to get 
StatsComponent to do what you need, although you'll probably run into 
some tricky parts becuase this isn't really Solr's focus.


But if you need a whole bunch of math and no text indexing at all -- use 
a tool that has math rather than text search as it's prime area of 
competency/focus, don't make things hard for yourself by using the wrong 
tool for the job.


(StatsComponent, incidentally, performs not-so-great on very large 
result sets, depending on what you ask it for).


Re: Indexing tweet and searching @keyword OR #keyword

2011-08-04 Thread Jonathan Rochkind
It's the WordDelimiterFactory in your filter chain that's removing the 
punctuation entirely from your index, I think.


Read up on what the WordDelimiter filter does, and what it's settings 
are; decide how you want things to be tokenized in your index to get the 
behavior your want; either get WordDelimiter to do it that way by 
passing it different arguments, or stop using WordDelimiter; come back 
with any questions after trying that!



On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType



Re: Is there anyway to sort differently for facet values?

2011-08-04 Thread Jonathan Rochkind

No, it can not. It just sorts alphabetically, actually by raw byte-order.

No other facet sorting functionality is available, and it would be 
tricky to implement in a performant way because of the way lucene 
works.  But it would certainly be useful to me too if someone could 
figure out a way to do it.


On 8/4/2011 2:43 PM, Way Cool wrote:

Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
I will try that though.

Can it handle the values below in the correct order?
Under 10
10 - 20
20 - 30
Above 30

Or
Small
Medium
Large
XL
...

My second question is that if Solr can't do that for the values above by
using facet.sort. Is there any other ways in Solr?

Thanks in advance,

YH

On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote:


have you looked at the facet.sort parameter? The index value is what I
think you want.

Best
Erick
On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com  wrote:

Hi, guys,

Is there anyway to sort differently for facet values? For example,

sometimes

I want to sort facet values by their values instead of # of docs, and I

want

to be able to have a predefined order for certain facets as well. Is that
possible in Solr we can do that?

Thanks,

YH


Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?

2011-08-04 Thread Jonathan Rochkind
I'm not sure what you mean by index distribution, that could possibly 
mean several things.


But Solr has had a replication feature built into it from 1.4, that can 
probably handle the same use cases as rsync, but better.  So that may be 
what you want.


There are certainly other experiments going on involving various kinds 
of scaling distribution, that I'm not familiar with, including the 
sharding feature, that I'm not very familiar with. I don't know if 
anyone's tried to do anything with hadoop.




On 8/4/2011 2:52 PM, Way Cool wrote:

Hi, guys,

What's the best way (practice) to do index distribution at this moment?
Hadoop? or rsyncd (back to 3 years ago ;-)) ?

Thanks,

Yugang



Re: lucene/solr, raw indexing/searching

2011-08-04 Thread Jonathan Rochkind

It depends. Okay, the source contains 4 harv. l. rev. 45 .

Do you want a user entered harv. to ALSO match harv (without the 
period) in source, and vice versa? Or do you require it NOT match? Or do 
you not care?


The default filter analysis chain will index 4 harv. l. rev. 45 
essentially as 4;harv;l;rev;45.  A phrase search for
4 harv. l. rev. 45 will match it, but so will a phrase search for 4 
harv l rev 45 , and in fact so will a phrase search for 4 harv. l. rev45


That could be good, or it could be bad.

The point of the Solr analysis chain is to apply tokenization and 
transformation at both index time and query time, so queries will match 
source in the way you want. You can customize this analysis chain 
however you want, in extreme cases even writing your own analyzers in 
Java. If the out of the box default isn't doing what you want, you'll 
have to spend some time thinking about how an inverted index like lucene 
works, and what you want. You would need to provide a lot more 
specifications/details for someone else to figure out what analysis 
chain will do what you want, but I bet you can figure it our yourself 
after reading up a bit and thinking up a bit.


See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 On 8/4/2011 4:30 PM, dhastings wrote:

I have decided to use solr for indexing as well.

the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
 3.56,
  4 harv. l. rev. 45,
  187-532,
 3 llm 56,
  5 unts 8,
 6 u.n.t.s. 78,
 father's obligation


i seem to keep running into issues getting this to work.  the searching is
being done on a text field that is not stored.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dismax mm per field

2011-08-03 Thread Jonathan Rochkind
There is not, and the way dismax works makes it not really that feasible 
in theory, sadly.


One thing you could do instead is combine multiple separate dismax 
queries using the nested query syntax. This will effect your relevancy 
ranking possibly in odd ways, but anything that accomplishes 'mm per 
field' will neccesarily not really be using dismax's disjunction-max 
relevancy ranking in the way it's intended.


Here's how you could combine two seperate dismax queries:

defType=lucene
q=_query_:{!dismax qf=field1 mm=100%}blah blah AND _query_:{!dismax 
qf=field2 mm=80%}foo bar


That whole q value would need to be properly URI escaped, which I 
haven't done here for human-readability.


Dismax has always got an mm, there's no way to not have an mm with 
dismax, but mm 100% might be what you mean. Of course, one of those 
queries could also not be dismax at all, but ordinary lucene query 
parser or anything else. And of course you could have the same query 
text for nested queries repeating eg blah blah in both.




On 8/3/2011 11:24 AM, Dmitriy Shvadskiy wrote:

Hello,
Is there a way to apply (e)dismax mm parameter per field? If I have a query
field1:(blah blah) AND field2:(foo bar)

is there a way to apply mm only to field2?

Thanks,
Dmitriy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Jonathan Rochkind
There's no great way to do this. I understand your problem as: It's a 
multi-valued field, but you want to sort on whichever of those values 
matched the query, not on the values that didn't. (Not entirely clear 
what to do if the documents are in the result set becuse of a match in 
an entirely different field!)


I would sometimes like to do that too, and haven't really been able to 
come up with any great way to do it.


Something involving facetting kind of gets you closer, but ends up being 
a huge pain and doesn't get  you (or at least me) all the way to 
supporting the interface I'd really want.


On 8/3/2011 10:39 AM, Olson, Ron wrote:

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.



Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Jonathan Rochkind
Not so much that it's a corner case in the sense of being unusual 
neccesarily (I'm not sure), it's just something that fundamentally 
doesn't fit well into lucene's architecture.


I'm not sure that filing a JIRA will be much use, it's really unclear 
how one would get lucene to do this, it would be signficant work to do, 
and it's unlikely any Solr developer is going to decide to spend 
signficant time on it unless they need it for their own clients.


On 8/3/2011 11:40 AM, Olson, Ron wrote:

*Sigh*...I had thought maybe reversing it would work, but that would require 
creating a whole new index, on a separate core, as the existing index is used 
for other purposes. Plus, given the volume of data, that would be a big deal, 
update-wise. What would be better would be to remove that particular sort 
option-button on the webpage. ;)

I'll create a Jira issue, but in the meanwhile I'll have to come up with something else. 
I guess I didn't realize how much of a corner case this problem is. :)

Thanks for the suggestions!

Ron

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org]
Sent: Wednesday, August 03, 2011 10:26 AM
To: solr-user@lucene.apache.org
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Hi Ron.
This is an interesting problem you have. One idea would be to create an index 
with the entity relationship going in the other direction.  So instead of one 
to many, go many to one.  You would end up with multiple documents with varying 
names but repeated parent entity information -- perhaps simply using just an ID 
which is used as a lookup. Do a search on this name field, sorting by a 
non-tokenized variant of the name field. Use Result-Grouping to consolidate 
multiple matches of a name to the same parent document. This whole idea might 
very well be academic since duplicating all the parent entity information for 
searching on that too might be a bit much than you care to bother with. And I 
don't think Solr 4's join feature addresses this use case. In the end, I think 
Solr could be modified to support this, with some work. It would make a good 
feature request in JIRA.

~ David Smiley

On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:


Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.



DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.



Re: Setting up Namespaces to Avoid Running Multiple Solr Instances

2011-08-03 Thread Jonathan Rochkind
I think that Solr multi-core (nothing to do with CPU cores, just what 
it's called in Solr) is what you're looking for. 
http://wiki.apache.org/solr/CoreAdmin


On 8/3/2011 2:25 PM, Mike Papper wrote:

Hi, we run several independent websites on the same machines. Each site uses
a similar codebase for search. Currently each site contacts its own solr
server on a slightly different port. This means of course that we are
running several solr servers (each on their own port) on the same machine. I
would like to make this simpler by running just one server, listening on one
port. Can we do this and at the same time have the indexes and search data
separated for each web site?

So, I'm asking if I can namespace or federate the solr server. But by doing
so I would like to have the indexes etc. not comingled within the server.

Im new to solr so there might be a hiccup from the fact that currently each
solr server points to its own directory on a site-specific path (something
like /apps/site/solr/*) which contains the solr plugin (were using ruby on
rails). Can this be setup as a namespace (one for each web site) within the
single server instance?

Mike



Re: lucene/solr, raw indexing/searching

2011-08-02 Thread Jonathan Rochkind
In your solr schema.xml, are the fields you are using defined as text 
fields with analyzers? It sounds like you want no analysis at all, which 
probably means you don't want text fields either, you just want string 
fields. That will make it impossible to search for individual tokens 
though, searches will match only on complete matches of the value.


I'm not quite sure how to do what you want, it depends on exactly what 
you want. What kind of searching do you expect to support?  If you still 
do want tokenization, you'll still want some analysis... but I'm not 
quite sure how that corresponds to what you'd want to do on the lucene 
end.  What you're trying to do is going to be inevitably confusing, I 
think. Which doesn't mean it's not possible.  You might find it less 
confusing if you were willing to use Solr to index though, rather than 
straight lucene -- you could use Solr via the SolrJ java classes, rather 
than the HTTP interface.


On 8/2/2011 11:14 AM, dhastings wrote:

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method.  I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do?  My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031.  searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

I do not want any of solr or lucenes attempts to fix my content or my
queries.  rev. needs to stay rev. and not turn into rev, 23.302
needs to stay as such, and NOT turn into 23302.  this is for BOTH indexing
and searching.

any hints?

right now for indexing i have:

 Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer  = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;


and for searching i have in my schema :


  fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer
 tokenizer class=solr.StandardTokenizerFactory/

 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType


Thanks.  Very much appreciated.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Jetty error message regarding EnvEntry in WebAppContext

2011-08-02 Thread Jonathan Rochkind

On 8/2/2011 11:42 AM, Marian Steinbach wrote:

Can anyone tell me how a working configuration for Jetty 6.1.22 would have
to look like?


You know that Solr distro comes with a jetty with a Solr in it, right, 
as an example application? Even if you don't want to use it for some 
reason, that would probably be the best model to look at for a working 
jetty with solr.


Or is the problem that you want a different version of jetty?

As it happens, I just recently set up a jetty 6.1.26 for another 
project, not for solr. It was kind of a pain not being too familiar with 
java deployment or jetty.  But I did get JDNI working, by following the 
jetty instructions here: http://docs.codehaus.org/display/JETTY/JNDI  
(It was a bit confusing to figure out what they were talking about not 
being familiar with jetty, but eventually I got it, and the instructions 
were correct.)


But if I wanted to run Solr in jetty, I'd start with the jetty that is 
distributed with solr, rather than trying to build my own.


Re: performance crossover between single index and sharding

2011-08-02 Thread Jonathan Rochkind
What's the reasoning  behind having three shards on one machine, instead 
of just combining those into one shard? Just curious.  I had been 
thinking the point of shards was to get them on different machines, and 
there'd be no reason to have multiple shards on one machine.


On 8/2/2011 1:59 PM, Burton-West, Tom wrote:

Hi Markus,

Just as a data point for a very large sharded index, we have the full text of 
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 
machines. Each machine has 3 shards. The size of each shard ranges between 
475GB and 550GB.  We are definitely I/O bound. Our machines have 144GB of 
memory with about 16GB dedicated to the tomcat instance running the 3 Solr 
instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. 
 We release a new index every morning and then warm the caches with several 
thousand queries.  I probably should add that our disk storage is a very high 
performance Isilon appliance that has over 500 drives and every block of every 
file is striped over no less than 14 different drives. (See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest 
response time goal is to keep 99th percentile response time for our application 
(i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time  300 ms
median response time   113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard size 
and optimum number of shards per machine, but that has remained on the back 
burner for a long time as other higher priority items keep pushing it down on 
the todo list.

We would be really interested to hear about the experiences of people who have 
so many shards that the overhead of distributing the queries, and 
consolidating/merging the responses becomes a serious issue.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

* 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide
some metrics for an index of many terabytes.


[..] At some extreme point there will be diminishing
returns and a performance decrease, but I wouldn't worry about that at all
until you've got many terabytes -- I don't know how many but don't worry
about it.

~ David

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
list archive at Nabble.com.


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind
Any changes you make related to stemming or normalization are likely 
going to require a re-index, just how it goes, just how solr/lucene 
works.  What you can do just by normalizing at query time is limited, 
almost any good solution to this type of problem is going to require 
normalization at index time.


If you're going to be fiddling with a production solr, it pays to figure 
out a workflow such that you can introduce indexing changes without 
downtime, this is not the last time you'll have to do it.


On 8/1/2011 12:35 PM, thomas wrote:

Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind

On 8/1/2011 12:42 PM, Paul Libbrecht wrote:
Otherwise i need to backup the whole index and try to reindex 
overnight when

cms users are sleeping.

With some work you can do this using an extra solr that just pulls everything, 
then swaps the indexes (that needs a bit of downtime), then re-indexes the 
things changed during the night.
I feel this should be a standard feature of SOLR...



It sort of is, in the sense that you can do it with replication, with no 
downtime. (Although you'll need enough disk and RAM in the slave to warm 
the replicated index while still serving queries from the older index, 
for no downtime).


Reindex to a seperate solr (or seperate core), then have the actual 
production core set up as a slave, and have it replicate from master 
when the re-indexing is done.  You can have your relevant conf files 
(schema or solrconfig) set up to replicate too, so you get those new 
ones in production exactly when you get the new indexes they go with.


The replication features isn't exactly set up for this, so it gets a bit 
confusing. I set up the 'slave' with NO polling.  It still needs to be 
set up with config saying it's a slave though. And it still needs to 
have a 'master' URL in there, even though you can also supply/over-ride 
the master URL with a manual replicate command, if there's no master URL 
at all, Solr will refuse to start up.   So I config the master URL, but 
without any polling for changes. Then I manually issue an HTTP replicate 
command to slave only when I have a rebuilt index in master I want to 
swap in. It seems to be working.


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind

On 8/1/2011 1:40 PM, Mike Sokolov wrote:
If you want to avoid re-indexing, you could consider building a 
synonym file that is generated using your rule set, and then using 
that to expand your queries.  You'd need to get a list of all terms in 
your index and then process them to generate synyonyms.  Actually, I 
don't know how to get a list of all the terms without Java 
programming, though: is there a way?


The terms compoennt will give you a list of all terms, I think. 
http://wiki.apache.org/solr/TermsComponent


But this is getting awfully hacky and hard to maintain simply to avoid 
doing a re-index. I still think doing a re-index is a normal part of 
evolving your Solr configuration, and better to just get used to it (and 
figure out how to do it in production with no or minimal downtime) now.




Re: colocated term stats

2011-07-28 Thread Jonathan Rochkind

Not sure if this will do what you want, but one way might be using facets.

Take the term you are interested in, and apply it as an fq.  Now the 
result set will include only documents that include that term.  So also 
request facets for that result set, the top 10 facets are the top 10 
terms that appear in that result set -- which is the top 10 terms that 
appear in documents together with your fq constraint. (Okay, you might 
need to look at 11, because one of the facet values will be the same 
term you fq constrained). You don't need to look at actual documents at 
all (rows=0), just facet response.


Make sense? Does that do what you want?

On 7/27/2011 9:12 PM, Twomey, David wrote:

Given a query term, is it possible to get from the index the top 10 collocated 
terms in the index.

ie:  return the top 10 terms that appear with this term based on doc count.

A plus would be to add some constraints on how near the terms are in the docs.






Re: Exact match not the first result returned

2011-07-28 Thread Jonathan Rochkind
Keep in mind that if you use a field type that includes spaces (eg 
StrField, or KeywordTokenizer), then if you're using dismax or lucene 
query parsers, the only way to find matches in this field on queries 
that include spaces will be to do explicit phrase searches with double 
quotes.


These fields will, however, work fine with pf in dismax/edismax as per 
Hoss's example.


But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy 
of my searchable field. I use a pf on that field with a very high boost 
to try and boost truly complete matches, that match the entirety of 
the value.  It's not exactly 'exact', I still do some normalization, 
including flattening unicode to ascii, and normalizing 1 or more 
string-or-punctuation to exactly 1 one space using a char regex filter.


It seems to pretty much work -- this is just one of various relevancy 
tweaks I've got going on, to the extent that my relevancy has become 
pretty complicated and hard to predict and doesn't always do what I'd 
expect/intend, but this particular aspect seems to mostly pretty much work.


On 7/27/2011 10:55 PM, Chris Hostetter wrote:

: With your solution, RECORD 1 does appear at the top but I think thats just
: blind luck more than anything else because RECORD 3 shows as having the same
: score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
: like all three records returned with RECORD 1 being the first listing.

with omitNorms RECORD1 and RECORD3 have the same score because only the
tf() matters, and both docs contain the term frank exactly twice.

the reason RECORD1 isn't scoring higher even though it contains (as you
put it matchings 'Fred' exactly is that from a term perspective, RECORD1
doesn't actually match myname:Fred exactly, because there are in fact
other terms in that field because it's multivalued.

one way to indicate that you (only* want documents where entire field
values to match your input (ie: RECORD1 but no other records) would be to
use a StrField instead of a TextField or an analyzer that doesn't split up
tokens (lie: something using KeywordTokenizer).  that way a query on
myname:Frank would not match a document where you had indexed the value
Frank Stalone by a query for myname:Frank Stalone would.

in your case, you don't want *only* the exact field value matches, but you
want them boosted, so you could do something like copyField myname into
myname_str and then do...

   q=+myname:Frank myname_str:Frank^100

...in which case a match on myname is required, but a match on
myname_str will greatly increase the score.

dismax (and edismax) are really designed for situations like this...

   defType=dismax  qf=myname  pf=myname_str^100  q=Frank



-Hoss



Re: Possible to use quotes in dismax qf?

2011-07-28 Thread Jonathan Rochkind
It's not clear to me why you would try to do that, I'm not sure it makes 
a lot of sense.


You want to find all documents that have sail boat as a phrase AND 
have sail somewhere in them AND have boat somewhere in them?  That's 
exactly the same as just all documents that have sail boat as a phrase 
-- such documents will neccesarily include sail and boat, right?  So 
why not just ask for q=sail boat?


What are you actually trying to do?

Maybe dismax 'pf', which relevancy boosts documents which have your 
input as a phrase, si what you really want?  Then you'd just search for 
q=sail boat, but documents that included sail boat as a phrase 
would be boosted, at the boost you specify.


On 7/28/2011 10:00 AM, O. Klein wrote:

I want to do a dismax search to search for original query and this query as a
phrasequery:

q=sail boat needs to be converted to dismax query q=sail boat sail boat

qf=title^10 content^2

What is best way to do this?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Possible-to-use-quotes-in-dismax-qf-tp3206762p3206762.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index

2011-07-28 Thread Jonathan Rochkind
I have no idea what you mean. A file on your disk? What does INDEX in 
solr mean?   Be more specific and clear, perhaps provide an example,  
and maybe someone can help you.


On 7/28/2011 5:45 PM, GAURAV PAREEK wrote:

Hi All,

How we can check the particular;ar file is not INDEX in solr ?

Regards,
Gaurav



Re: An idea for an intersection type of filter query

2011-07-27 Thread Jonathan Rochkind
I don't know the answer to feasibilty either, but I'll just point out 
that boolean OR corresponds to set union, not set intersection.  
So I think you probably mean a 'union' type of filter query; 
'intersection' does not seem to describe what you are describing; 
ordinary 'fq' values are 'intersected' already to restrict the result 
set, no?


So, anyhow, the basic goal, if I understand it right, is not to provide 
any additional semantics, but to allow individual clauses in an 'fq' 
OR to be cached and looked up in the filter cache individually.


Perhaps someone (not me) who understands the Solr architecture better 
might also have another suggestion for how to get to that goal, other 
than the specific thing you suggested. I do not know, sorry.


Hmm, but I start thinking, what about a general purpose mechanism to 
identify a sub-clause that should be fetched/retrieved from the filter 
cache. I don't _think_ current nested queries will do that:


fq=_query_:foo:bar OR _query_:foo:baz

That's legal now (and doesn't accomplish much) -- but what if the 
individual subquery components could consult the filter cache 
seperately?  I don't know if nested query is the right way to do that or 
not, but I'm thinking some mechanism where you could arbitrarily 
identify clauses that should be filter cached independently?


Jonathan

On 7/27/2011 4:00 PM, Shawn Heisey wrote:
I've been looking at the slow queries our Solr installation is 
receiving.  They are dominated by queries with a simple q parameter 
(often *:* for all docs) and a VERY complicated fq parameter.  The 
filter query is built by going through a set of rules for the user and 
putting together each rule's query clause separated by OR -- we can't 
easily break it into multiple filters.


In addition to causing queries themselves to run slowly, this causes 
large autowarm times for our filterCache -- my filterCache 
autowarmCount is tiny (4), but it sometimes takes 30 seconds to warm.


I've seen a number of requests here for the ability to have multiple 
fq parameters ORed together.  This is probably possible, but in the 
interests of compatibility between versions, very impractical.  What 
if a new parameter was introduced?  It could be named fqi, for filter 
query intersection.  To figure out the final bitset for multiple fq 
and fqi parameters, it would use this kind of logic:


fq AND fq AND fq AND (fqi OR fqi OR fqi)

This would let us break our filters into manageable pieces that can 
efficiently populate the filterCache, and they would autowarm quickly.


Is the filter design in Solr separated cleanly enough to make this at 
all reasonable?  I'm not a Java developer, so I'd have a tough time 
implementing it myself.  When I have a free moment I will take a look 
at the code anyway.  I'm trying to teach myself Java.


Thanks,
Shawn




Re: Speeding up search by combining common sub-filters

2011-07-27 Thread Jonathan Rochkind
I'm pretty sure Solr/lucene have no such optimization already, but 
it's not clear to me that it would result in much of a performance 
benefit, just because of the way lucene works, it's not obvious to me 
that the second version of your query will be noticeably faster than the 
first version.


Maybe in cases with many many clauses, rather than the few clauses in 
your example. You'd definitely want to performance test it to verify 
there are any gains, before embarking on writing the 'optimization' -- 
you can test it just by sending the different versions of your real 
world queries to Solr and seeing what the response times are, 
calculating the hypothetically 'optimized' version yourself by hand if 
need be, right?




On 7/27/2011 5:05 PM, Scott Smith wrote:

We have a solr application which ends up creating queries with very complicated 
filters (literally hundreds and sometimes thousands of terms-typically a large 
number of terms OR'ed together where each of these terms might have a half a 
dozen keywords ANDed/ORed together).  In looking at the filters, I realized 
that there are often a lot of common sub-filters.

A simple example of what I mean is:

 (cat AND dog) OR (cat AND horse)

This could clearly be simplified by saying:

 cat AND (dog OR horse)

It turns out that finding and combining common sub-filters isn't trivial for our 
application.  So, before I start a project to attempt some kind of 
optimization, my question is whether it's likely that I will see significant 
decreases in query times to justify the development effort it takes to optimize the 
filters.  Certainly, if I thought I might get a 20%+ decrease in time, I would say it's 
probably a good project.  If it's just a few percentage points of improvement, then I'm 
less excited about doing it.

Does Solr already go through some kind of optimization which effectively 
combines common sub-filters and possibly duplicated terms?  Does anyone have 
any thoughts on this subject?

Thanks

Scott



slave data files way bigger than master

2011-07-26 Thread Jonathan Rochkind

So I've got Solr 1.4.  I've got replication going on.

Once a day, before replication, I optimize on master.  Then I replicate.

I'd expect optimization before replicate would basically replace all 
files on slave, this is expected.


But that means I'd also expect that the index files on slave would be 
identical, and the same size, as on master, after replication, this is 
the point of replication, yes?


But they are not. The master is only 12G, the slave is 39G.  The index 
files in slave and master have completely different filenames too, I 
don't know if that's expected, but it's not what I expected.  I'll post 
complete file lists below.


Anyone have any idea what's going on?  Also... I wonder if these extra 
index files on the slave are just extra not even looekd at by the slave 
solr, or if instead they actually ARE included in the indexes!  If the 
latter, and we have 'ghost' documents in the index, that could explain 
some weird problems I'm having with the slave getting Java out of heap 
space errors due to huge uninverted indexes, even though the index is 
basically the same with the same solrconfig.xml settings as it has been 
for a while, without such problems.


Greatly appreciate if anyone has any ideas.


MASTER: ls -lh master_index

total 12G
-rw-rw-r-- 1 tomcat tomcat  3.0G Jul 26 06:37 _24p.fdt
-rw-rw-r-- 1 tomcat tomcat   15M Jul 26 06:37 _24p.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 06:33 _24p.fnm
-rw-rw-r-- 1 tomcat tomcat  1.2G Jul 26 06:44 _24p.frq
-rw-rw-r-- 1 tomcat tomcat   49M Jul 26 06:44 _24p.nrm
-rw-rw-r-- 1 tomcat tomcat  1.1G Jul 26 06:44 _24p.prx
-rw-rw-r-- 1 tomcat tomcat  7.8M Jul 26 06:44 _24p.tii
-rw-rw-r-- 1 tomcat tomcat  660M Jul 26 06:44 _24p.tis
-rw-rw-r-- 1 tomcat tomcat  2.1G Jul 26 08:54 _2k4.fdt
-rw-rw-r-- 1 tomcat tomcat  7.6M Jul 26 08:54 _2k4.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 08:51 _2k4.fnm
-rw-rw-r-- 1 tomcat tomcat  719M Jul 26 08:59 _2k4.frq
-rw-rw-r-- 1 tomcat tomcat   25M Jul 26 08:59 _2k4.nrm
-rw-rw-r-- 1 tomcat tomcat  797M Jul 26 08:59 _2k4.prx
-rw-rw-r-- 1 tomcat tomcat  5.0M Jul 26 08:59 _2k4.tii
-rw-rw-r-- 1 tomcat tomcat  436M Jul 26 08:59 _2k4.tis
-rw-rw-r-- 1 tomcat tomcat  211M Jul 26 09:25 _2n3.fdt
-rw-rw-r-- 1 tomcat tomcat  774K Jul 26 09:25 _2n3.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 09:25 _2n3.fnm
-rw-rw-r-- 1 tomcat tomcat   72M Jul 26 09:26 _2n3.frq
-rw-rw-r-- 1 tomcat tomcat  2.5M Jul 26 09:26 _2n3.nrm
-rw-rw-r-- 1 tomcat tomcat   78M Jul 26 09:26 _2n3.prx
-rw-rw-r-- 1 tomcat tomcat  668K Jul 26 09:26 _2n3.tii
-rw-rw-r-- 1 tomcat tomcat   53M Jul 26 09:26 _2n3.tis
-rw-rw-r-- 1 tomcat tomcat  186M Jul 26 09:49 _2q6.fdt
-rw-rw-r-- 1 tomcat tomcat  774K Jul 26 09:49 _2q6.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 09:49 _2q6.fnm
-rw-rw-r-- 1 tomcat tomcat   60M Jul 26 09:50 _2q6.frq
-rw-rw-r-- 1 tomcat tomcat  2.5M Jul 26 09:50 _2q6.nrm
-rw-rw-r-- 1 tomcat tomcat   64M Jul 26 09:50 _2q6.prx
-rw-rw-r-- 1 tomcat tomcat  562K Jul 26 09:50 _2q6.tii
-rw-rw-r-- 1 tomcat tomcat   45M Jul 26 09:50 _2q6.tis
-rw-rw-r-- 1 tomcat tomcat  246M Jul 26 10:16 _2t9.fdt
-rw-rw-r-- 1 tomcat tomcat  774K Jul 26 10:16 _2t9.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 10:16 _2t9.fnm
-rw-rw-r-- 1 tomcat tomcat   68M Jul 26 10:17 _2t9.frq
-rw-rw-r-- 1 tomcat tomcat  2.5M Jul 26 10:17 _2t9.nrm
-rw-rw-r-- 1 tomcat tomcat   89M Jul 26 10:17 _2t9.prx
-rw-rw-r-- 1 tomcat tomcat  602K Jul 26 10:17 _2t9.tii
-rw-rw-r-- 1 tomcat tomcat   53M Jul 26 10:17 _2t9.tis
-rw-rw-r-- 1 tomcat tomcat  221M Jul 26 10:45 _2wc.fdt
-rw-rw-r-- 1 tomcat tomcat  774K Jul 26 10:45 _2wc.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 10:45 _2wc.fnm
-rw-rw-r-- 1 tomcat tomcat   69M Jul 26 10:46 _2wc.frq
-rw-rw-r-- 1 tomcat tomcat  2.5M Jul 26 10:46 _2wc.nrm
-rw-rw-r-- 1 tomcat tomcat   82M Jul 26 10:46 _2wc.prx
-rw-rw-r-- 1 tomcat tomcat  613K Jul 26 10:46 _2wc.tii
-rw-rw-r-- 1 tomcat tomcat   53M Jul 26 10:46 _2wc.tis
-rw-rw-r-- 1 tomcat tomcat   75M Jul 26 11:14 _2y6.fdt
-rw-rw-r-- 1 tomcat tomcat  315K Jul 26 11:14 _2y6.fdx
-rw-rw-r-- 1 tomcat tomcat   11M Jul 26 11:15 _2ze.fdt
-rw-rw-r-- 1 tomcat tomcat   42K Jul 26 11:15 _2ze.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 11:14 _2ze.fnm
-rw-rw-r-- 1 tomcat tomcat  157K Jul 26 11:14 _2ze.frq
-rw-rw-r-- 1 tomcat tomcat  6.9K Jul 26 11:14 _2ze.nrm
-rw-rw-r-- 1 tomcat tomcat  201K Jul 26 11:14 _2ze.prx
-rw-rw-r-- 1 tomcat tomcat  3.8K Jul 26 11:14 _2ze.tii
-rw-rw-r-- 1 tomcat tomcat  293K Jul 26 11:14 _2ze.tis
-rw-rw-r-- 1 tomcat tomcat  224M Jul 26 11:14 _2zf.fdt
-rw-rw-r-- 1 tomcat tomcat  774K Jul 26 11:14 _2zf.fdx
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 11:14 _2zf.fnm
-rw-rw-r-- 1 tomcat tomcat   79M Jul 26 11:15 _2zf.frq
-rw-rw-r-- 1 tomcat tomcat  2.5M Jul 26 11:15 _2zf.nrm
-rw-rw-r-- 1 tomcat tomcat   88M Jul 26 11:15 _2zf.prx
-rw-rw-r-- 1 tomcat tomcat  869K Jul 26 11:15 _2zf.tii
-rw-rw-r-- 1 tomcat tomcat   76M Jul 26 11:15 _2zf.tis
-rw-rw-r-- 1 tomcat tomcat   836 Jul 26 11:14 _2zg.fnm
-rw-rw-r-- 1 tomcat tomcat  

Re: commit time and lock

2011-07-25 Thread Jonathan Rochkind

Thanks, this is helpful.

I do indeed periodically update or delete just about every doc in the 
index, so it makes sense that optimization might be neccesary even in 
post 1.4, but I'm still on 1.4 -- add this to another thing to look into 
rather than assume after I upgrade.


Indeed I was aware that it would trigger a pretty complete index 
replication, but, since it seemed to greatly improve performance (in 
1.4), so it goes. But yes, I'm STILL only updating once a day, even with 
all that. (And in fact, I'm only replicating once a day too, ha).


On 7/25/2011 10:50 AM, Erick Erickson wrote:

Yeah, the 1.4 code base is older. That is, optimization will have more
effect on that vintage code than on 3.x and trunk code.

I should have been a bit more explicit in that other thread. In the case
where you add a bunch of documents, optimization doesn't buy you all
that much currently. If you delete a bunch of docs (or update a bunch of
existing docs), then optimization will reclaim resources. So you *could*
have a case where the size of your index shrank drastically after
optimization (say you updated the same 100K documents 10 times then
optimized).

But even that is it depends (tm). The new segment merging, as I remember,
will possibly reclaim deleted resources, but I'm parroting people who actually
know, so you might want to verify that if it

Optimization will almost certainly trigger a complete index replication to any
slaves configured, though.

So the usual advice is to optimize maybe once a day or week during off hours
as a starting point unless and until you can verify that your
particular situation
warrants optimizing more frequently.

Best
Erick

On Fri, Jul 22, 2011 at 11:53 AM, Jonathan Rochkindrochk...@jhu.edu  wrote:

How old is 'older'?  I'm pretty sure I'm still getting much faster performance 
on an optimized index in Solr 1.4.

This could be due to the nature of my index and queries (which include some 
medium sized stored fields, and extensive facetting -- facetting on up to a 
dozen fields in every request, where each field can include millions of unique 
values. Amazing I can do this with good performance at all!).

It's also possible i'm wrong about that faster performance, i haven't done 
robustly valid benchmarking on a clone of my production index yet. But it 
really looks like that way to me, from what investigation I have done.

If the answer is that optimization is believed no longer neccesary on versions 
LATER than 1.4, that might be the simplest explanation.

From: Pierre GOSSE [pierre.go...@arisem.com]
Sent: Friday, July 22, 2011 10:23 AM
To: solr-user@lucene.apache.org
Subject: RE: commit time and lock

Hi Mark

I've read that in a thread title  Weird optimize performance degradation, where Erick Erickson 
states that Older versions of Lucene would search faster on an optimized index, but this is no longer 
necessary., and more recently in a thread you initiated a month ago Question about 
optimization.

I'll also be very interested if anyone had a more precise idea/datas of 
benefits and tradeoff of optimize vs merge ...

Pierre


-Message d'origine-
De : Marc SCHNEIDER [mailto:marc.schneide...@gmail.com]
Envoyé : vendredi 22 juillet 2011 15:45
À : solr-user@lucene.apache.org
Objet : Re: commit time and lock

Hello,

Pierre, can you tell us where you read that?
I've read here that optimization is not always a requirement to have an
efficient index, due to some low level changes in lucene 3.xx

Marc.

On Fri, Jul 22, 2011 at 2:10 PM, Pierre GOSSEpierre.go...@arisem.comwrote:


Solr will response for search during optimization, but commits will have to
wait the end of the optimization process.

During optimization a new index is generated on disk by merging every
single file of the current index into one big file, so you're server will be
busy, especially regarding disk access. This may alter your response time
and has very negative effect on the replication of index if you have a
master/slave architecture.

I've read here that optimization is not always a requirement to have an
efficient index, due to some low level changes in lucene 3.xx, so maybe you
don't really need optimization. What version of solr are you using ? Maybe
someone can point toward a relevant link about optimization other than solr
wiki
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

Pierre


-Message d'origine-
De : Jonty Rhods [mailto:jonty.rh...@gmail.com]
Envoyé : vendredi 22 juillet 2011 12:45
À : solr-user@lucene.apache.org
Objet : Re: commit time and lock

Thanks for clarity.

One more thing I want to know about optimization.

Right now I am planning to optimize the server in 24 hour. Optimization is
also time taking ( last time took around 13 minutes), so I want to know
that
:

1. when optimization is under process that time will solr server response
or
not?
2. if server will not response then how to do 

RE: Re: previous and next rows of current record

2011-07-22 Thread Jonathan Rochkind
  Yes exactly same problem i m facing. Is there any way to resolve this issue..

I am not sure what you mean, any way to resolve this issue. Did you read and 
understand what I wrote below? I have nothing more to add.  What is it you're 
looking for?

The way to provide that kind of next/previous is to write application code to 
do it. Although it's not easy to do cleanly a web app because of the 
sessionless architecture of the web. What are you using for your client 
application?  But honestly I probably have nothing more to say on the topic.


From : Jonathan Rochkind
To : solr-user@lucene.apache.org;
Subject : Re: previous and next rows of current record


I think maybe I know what you mean.

You have a result set generated by a query. You have an item detail page
in your web app -- on that item detail page, you want to give
next/previous buttons for current search results.

If that's it, read on (although news isn't good), if that's not it,
ignore me.

There is no good way to do it. Although it's not really so much a solr
problem.  As far as Solr is concerned, if you know the query, and you
know the current row into the query i, then just ask Solr for
rows=1start=$(i=1) to get previous, or i+1 to get next. (You can't send
$(i-1) or $(i+1) to Solr that's just short hand, your app would have to
calculate em and send the literals).

The problem is architecting a web app so when you are on an item detail
page, the app knows what the current Solr query was, and what the i
index into it was.

The app I work on wants to provide this feature too, but I am so unhappy
with what it currently does (it is both ugly AND does not actually work
at all right on several very common cases), that I am definitely not
going to provide it as an example.  But if you are willing to have your
web app send the current search and the index in the URL to the item
detail page, that'd certainly make it easier.

It's not so much a Solr problem -- the answer in Solr is pretty clear.
Keep track of what index into your results you are on, and then just ask
for one previous or more.  But there's no great way to make a web app
taht actually does that without horrid urls.  There's nothing built into
Solr to help you. Solr is pretty much sessionless/stateless, it's got no
idea what the 'current' search for your particular session is.



On 7/21/2011 2:38 PM, Bob Sandiford wrote:
 But - what is it that makes '9' the next id after '5'?  why not '6'?  Or 
 '91238412'? or '4'?

 i.e. you still haven't answered the question about what 'next' and 'previous' 
 really means...

 But - if you already know that '9' is the next page, why not just do another 
 query with id '9' to get the next record?

 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com


 -Original Message-
 From: Jonty Rhods [mailto:jonty.rh...@gmail.com]
 Sent: Thursday, July 21, 2011 2:33 PM
 To: solr-user@lucene.apache.org
 Subject: Re: previous and next rows of current record

 Hi

 in my case there is no id sequence. id is generated sequence wise for
 all category. but when we filter by category then same id become
 random. If i m on detail page which have id 5 and nrxt id is 9 so on
 same page my requirment is to get next id is 9.

 On Thursday, July 21, 2011, Bob Sandiford
   wrote:
 Well, it sort of depends on what you mean by the 'previous' and the
 'next' record.
 Do you have some type of sequencing built into your concept of your
 solr / lucene indexes?  Do you have sequential id's?
 i.e. What's the use case, and what's the data available to support
 your use case?
 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com

 -Original Message-
 From: Jonty Rhods [mailto:jonty.rh...@gmail.com]
 Sent: Thursday, July 21, 2011 2:18 PM
 To: solr-user@lucene.apache.org
 Subject: Re: previous and next rows of current record

 Pls help..

 On Thursday, July 21, 2011, Jonty Rhods
 wrote:
 Hi,

 Is there any special query in solr to get the previous and next
 record of current record.I am getting single record detail using id
 from solr server. I need to get  next and previous on detail page.
 regardsJonty








RE: commit time and lock

2011-07-22 Thread Jonathan Rochkind
How old is 'older'?  I'm pretty sure I'm still getting much faster performance 
on an optimized index in Solr 1.4. 

This could be due to the nature of my index and queries (which include some 
medium sized stored fields, and extensive facetting -- facetting on up to a 
dozen fields in every request, where each field can include millions of unique 
values. Amazing I can do this with good performance at all!). 

It's also possible i'm wrong about that faster performance, i haven't done 
robustly valid benchmarking on a clone of my production index yet. But it 
really looks like that way to me, from what investigation I have done. 

If the answer is that optimization is believed no longer neccesary on versions 
LATER than 1.4, that might be the simplest explanation. 

From: Pierre GOSSE [pierre.go...@arisem.com]
Sent: Friday, July 22, 2011 10:23 AM
To: solr-user@lucene.apache.org
Subject: RE: commit time and lock

Hi Mark

I've read that in a thread title  Weird optimize performance degradation, 
where Erick Erickson states that Older versions of Lucene would search faster 
on an optimized index, but this is no longer necessary., and more recently in 
a thread you initiated a month ago Question about optimization.

I'll also be very interested if anyone had a more precise idea/datas of 
benefits and tradeoff of optimize vs merge ...

Pierre


-Message d'origine-
De : Marc SCHNEIDER [mailto:marc.schneide...@gmail.com]
Envoyé : vendredi 22 juillet 2011 15:45
À : solr-user@lucene.apache.org
Objet : Re: commit time and lock

Hello,

Pierre, can you tell us where you read that?
I've read here that optimization is not always a requirement to have an
efficient index, due to some low level changes in lucene 3.xx

Marc.

On Fri, Jul 22, 2011 at 2:10 PM, Pierre GOSSE pierre.go...@arisem.comwrote:

 Solr will response for search during optimization, but commits will have to
 wait the end of the optimization process.

 During optimization a new index is generated on disk by merging every
 single file of the current index into one big file, so you're server will be
 busy, especially regarding disk access. This may alter your response time
 and has very negative effect on the replication of index if you have a
 master/slave architecture.

 I've read here that optimization is not always a requirement to have an
 efficient index, due to some low level changes in lucene 3.xx, so maybe you
 don't really need optimization. What version of solr are you using ? Maybe
 someone can point toward a relevant link about optimization other than solr
 wiki
 http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

 Pierre


 -Message d'origine-
 De : Jonty Rhods [mailto:jonty.rh...@gmail.com]
 Envoyé : vendredi 22 juillet 2011 12:45
 À : solr-user@lucene.apache.org
 Objet : Re: commit time and lock

 Thanks for clarity.

 One more thing I want to know about optimization.

 Right now I am planning to optimize the server in 24 hour. Optimization is
 also time taking ( last time took around 13 minutes), so I want to know
 that
 :

 1. when optimization is under process that time will solr server response
 or
 not?
 2. if server will not response then how to do optimization of server fast
 or
 other way to do optimization so our user will not have to wait to finished
 optimization process.

 regards
 Jonty



 On Fri, Jul 22, 2011 at 2:44 PM, Pierre GOSSE pierre.go...@arisem.com
 wrote:

  Solr still respond to search queries during commit, only new indexations
  requests will have to wait (until end of commit?). So I don't think your
  users will experience increased response time during commits (unless your
  server is much undersized).
 
  Pierre
 
  -Message d'origine-
  De : Jonty Rhods [mailto:jonty.rh...@gmail.com]
  Envoyé : jeudi 21 juillet 2011 20:27
  À : solr-user@lucene.apache.org
  Objet : Re: commit time and lock
 
  Actually i m worried about the response time. i k commiting around 500
  docs in every 5 minutes. as i know,correct me if i m wrong; at the
  time of commiting solr server stop responding. my concern is how to
  minimize the response time so user not need to wait. or any other
  logic will require for my case. please suggest.
 
  regards
  jonty
 
  On Tuesday, June 21, 2011, Erick Erickson erickerick...@gmail.com
 wrote:
   What is it you want help with? You haven't told us what the
   problem you're trying to solve is. Are you asking how to
   speed up indexing? What have you tried? Have you
   looked at: http://wiki.apache.org/solr/FAQ#Performance?
  
   Best
   Erick
  
   On Tue, Jun 21, 2011 at 2:16 AM, Jonty Rhods jonty.rh...@gmail.com
  wrote:
   I am using solrj to index the data. I have around 5 docs indexed.
 As
  at
   the time of commit due to lock server stop giving response so I was
   calculating commit time:
  
   double starttemp = System.currentTimeMillis();
   server.add(docs);
   server.commit();
   

Re: Java replication takes slaves down

2011-07-21 Thread Jonathan Rochkind
How often do you replicate? Could it be a too-frequent-commit problem? 
(a replication is a commit to the slave).


On 7/21/2011 4:39 AM, Alexander Valet | edelight wrote:

Hi everybody,

we are using Solr 1.4.1 as our search backend and are replicating (Java based) 
from one master to four slaves.
When our index data grew in size (optimized around 4,5 GB) lately we started 
having huge trouble to spread a new index to
the slaves. They run on 100% CPU and are not able to serve request anymore. We 
have to kill the
Java process to start them again...

Does anybody have a similar experience? Any hints or ideas on how to set up 
proper replication?


Thanks,
Alex






Re: previous and next rows of current record

2011-07-21 Thread Jonathan Rochkind

I think maybe I know what you mean.

You have a result set generated by a query. You have an item detail page 
in your web app -- on that item detail page, you want to give 
next/previous buttons for current search results.


If that's it, read on (although news isn't good), if that's not it, 
ignore me.


There is no good way to do it. Although it's not really so much a solr 
problem.  As far as Solr is concerned, if you know the query, and you 
know the current row into the query i, then just ask Solr for 
rows=1start=$(i=1) to get previous, or i+1 to get next. (You can't send 
$(i-1) or $(i+1) to Solr that's just short hand, your app would have to 
calculate em and send the literals).


The problem is architecting a web app so when you are on an item detail 
page, the app knows what the current Solr query was, and what the i 
index into it was.


The app I work on wants to provide this feature too, but I am so unhappy 
with what it currently does (it is both ugly AND does not actually work 
at all right on several very common cases), that I am definitely not 
going to provide it as an example.  But if you are willing to have your 
web app send the current search and the index in the URL to the item 
detail page, that'd certainly make it easier.


It's not so much a Solr problem -- the answer in Solr is pretty clear. 
Keep track of what index into your results you are on, and then just ask 
for one previous or more.  But there's no great way to make a web app 
taht actually does that without horrid urls.  There's nothing built into 
Solr to help you. Solr is pretty much sessionless/stateless, it's got no 
idea what the 'current' search for your particular session is.




On 7/21/2011 2:38 PM, Bob Sandiford wrote:

But - what is it that makes '9' the next id after '5'?  why not '6'?  Or 
'91238412'? or '4'?

i.e. you still haven't answered the question about what 'next' and 'previous' 
really means...

But - if you already know that '9' is the next page, why not just do another 
query with id '9' to get the next record?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com



-Original Message-
From: Jonty Rhods [mailto:jonty.rh...@gmail.com]
Sent: Thursday, July 21, 2011 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: previous and next rows of current record

Hi

in my case there is no id sequence. id is generated sequence wise for
all category. but when we filter by category then same id become
random. If i m on detail page which have id 5 and nrxt id is 9 so on
same page my requirment is to get next id is 9.

On Thursday, July 21, 2011, Bob Sandiford
bob.sandif...@sirsidynix.com  wrote:

Well, it sort of depends on what you mean by the 'previous' and the

'next' record.

Do you have some type of sequencing built into your concept of your

solr / lucene indexes?  Do you have sequential id's?

i.e. What's the use case, and what's the data available to support

your use case?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com


-Original Message-
From: Jonty Rhods [mailto:jonty.rh...@gmail.com]
Sent: Thursday, July 21, 2011 2:18 PM
To: solr-user@lucene.apache.org
Subject: Re: previous and next rows of current record

Pls help..

On Thursday, July 21, 2011, Jonty Rhodsjonty.rh...@gmail.com

wrote:

Hi,

Is there any special query in solr to get the previous and next

record of current record.I am getting single record detail using id
from solr server. I need to get  next and previous on detail page.

regardsJonty









Re: Determine which field term was found?

2011-07-21 Thread Jonathan Rochkind

I've had this problem too, although never come up with a good solution.

I've wondered, is there any clever way to use the highlighter to 
accomplish tasks like this, or is that more trouble than any help it'll 
get you?


Jonathan

On 7/21/2011 5:27 PM, Yonik Seeley wrote:

On Thu, Jul 21, 2011 at 4:47 PM, Olson, Ronrol...@lbpc.com  wrote:

Is there an easy way to find out which field matched a term in an OR query using Solr? I have a 
document with names in two multi-valued fields and I am searching for Smith, using the 
query A_NAMES:smith OR B_NAMES:smith. I figure I could loop through both result arrays, 
but that seems weird to me to have to search again for the value in a result.

That's pretty much the way lucene currently works - you don't know
what fields match a query.
If the query is simple, looping over the returned stored fields is
probably your best bet.

There are a couple other tricks you could use (although they are not
necessarily better):
1) with grouping by query (a trunk feature) you can essentially return
both queries with one request:
   q=*:*group=truegroup.query=A_NAMES:smithgroup.query=B_NAMES:smith
   and optionally add a group.query=A_NAMES:smith OR B_NAMES:smith if
you need the combined list
2) use pseudo-fields (also trunk) in conjunction with the termfreq
function (the number of times a term appears in a field).  This
obviously only works with term queries.
   fl=*,count1:termfreq(A_NAMES,'smith'),count2:termfreq(B_NAMES,'smith')
   You can use parameter substitution to pull out the actual term and
simplify the query:
   fl=*,count1:termfreq(A_NAMES,$term),count2:termfreq(B_NAMES,$term)term=smith


-Yonik
http://www.lucidimagination.com



Re: defType argument weirdness

2011-07-20 Thread Jonathan Rochkind
Huh, I'm still not completely following. I'm sure it makes sense if you 
understand the underlying implemetnation, but I don't understand how 
'type' and 'defType' don't mean exactly the same thing, just need to be 
expressed differently in different location.


Sorry for beating a dead horse, but maybe it would help if you could 
tell me what I'm getting wrong here:


defType can only go in top-level param, and determines the query parser 
for the overall q top level param.


type can only go  in a LocalParam, and determines the query parser that 
applies to whatever query (top-level or nested) that the LocalParam 
syntax lives in.  (Just as any other LocalParams apply only to the query 
that the LocalParam block lives in -- and nested queries inherit their 
query parser from the query they are nested in unless over-ridden, just 
as they inherit every other param from the query they are nested in 
unless over-ridden, nothing special here).


Therefore for instance:

defType=dismaxq=foo

is equivalent to

defType=luceneq={!type=dismax}foo


Where am I straying in my mental model here? Because if all that is 
true, I don't understand how 'type' and 'defType' mean anything 
different -- they both choose the query parser, do they not? (which to 
me means I wish they were both called 'parser' instead of 'type' -- a 
'type' here is the name of a query parser, is it not?)  It's just that 
if it's in the top-level param you have to use 'defType', and if it's in 
a LocalParam you have to use 'type'.  That's been my mental model, which 
has served me well so far, but if it's wrong and it's going to trip me 
up on some as yet unencountered use cases, it would probably be good for 
me to know it!  (And probably good for some documentation to be written 
somewhere explaining it too). (And if they really are different, 
prefixing def to type is not making it very clear what the 
difference is! What's def supposed to stand for anyway?)


Jonathan


On 7/20/2011 3:49 PM, Chris Hostetter wrote:

: I do understand what they do (at least well enough to use them), but I
: find it confusing that it's called defType as a main param, but type
: in a LocalParam, when to me they both seem to do the same thing -- which

type as a localparam in a query string defines the type of query string
it is -- picking hte parser.

defType determins the default value for type in the primary query
string.

: (and then there's 'qt', often confused with defType/type by newbies,
: since they guess it stands for 'query type', but which should probably
: actually have been called 'requestHandler'/'rh' instead, since that's
: what it actually chooses, no?  It gets very confusing).
:
: If it's generally recognized it's confusing and perhaps a somewhat
: inconsistent mental model being implied, I wonder if there'd be any
: interest in renaming these to be more clear, leaving the old ones as
: aliases/synonyms for backwards compatibility (perhaps with a long

qt is historic and already being de-emphasized in favor of using
path based names (ie: http://solr/handlername instead of
http://solr/select?qt=/handlername) so adding yet another alias for that
would be moving in the wrong direction.

type and defType probably make more sense when you think of
them in that order.  I don't see a strong need to confuse/complicate the
issue by adding more aliases for them.



-Hoss



RE: Updating fields in an existing document

2011-07-20 Thread Jonathan Rochkind
Nope, you're not missing anything, there's no way to alter a document in an 
index but reindexing the whole document. Solr's architecture would make it 
difficult (although never say impossible) to do otherwise. But you're right it 
would be convenient for people other than you. 

Reindexing a single document ought not to be slow, although if you have many of 
them at once it could be, or if you end up needing to very frequently commit to 
an index it can indeed cause problems. 

From: Benson Margulies [bimargul...@gmail.com]
Sent: Wednesday, July 20, 2011 6:05 PM
To: solr-user
Subject: Updating fields in an existing document

We find ourselves in the following quandry:

At initial index time, we store a value in a field, and we use it for
facetting. So it, seemingly, has to be there as a field.

However, from time to time, something happens that causes us to want
to change this value. As far as we know, this requires us to
completely re-index the document, which is slow.

It struck me that we can't be the only people to go down this road, so
I write to inquire if we are missing something.


RE: defType argument weirdness

2011-07-19 Thread Jonathan Rochkind
Is it generally recognized that this terminology is confusing, or is it just 
me?  

I do understand what they do (at least well enough to use them), but I find it 
confusing that it's called defType as a main param, but type in a 
LocalParam, when to me they both seem to do the same thing -- which I think 
should probably be called 'queryParser' rather than 'type' or 'defType'.  
That's what they do, choose the query parser for the query they apply to, 
right?  (And if they did/do different things, 'defType' vs 'type' doesn't 
really provide much hint as to what!)

These are both the same, right, but with different param names depending on 
position:
defType=luceneq=foo
q={!type=lucene}foo  # uri escaping not shown

(and then there's 'qt', often confused with defType/type by newbies, since they 
guess it stands for 'query type', but which should probably actually have been 
called 'requestHandler'/'rh' instead, since that's what it actually chooses, 
no?  It gets very confusing). 

If it's generally recognized it's confusing and perhaps a somewhat inconsistent 
mental model being implied, I wonder if there'd be any interest in renaming 
these to be more clear, leaving the old ones as aliases/synonyms for backwards 
compatibility (perhaps with a long deprecation period, or perhaps existing 
forever). I know it was very confusing to me to keep track of these parameters 
and what they did for quite a while, and still trips me up from time to time. 

Jonathan

From: ysee...@gmail.com [ysee...@gmail.com] on behalf of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Tuesday, July 19, 2011 9:40 PM
To: solr-user@lucene.apache.org
Subject: Re: defType argument weirdness

On Tue, Jul 19, 2011 at 1:25 PM, Naomi Dushay ndus...@stanford.edu wrote:
 Regardless, I thought that defType=dismaxq=*:*   is supposed to be
 equivalent to  q={!defType=dismax}*:*  and also equivalent to q={!dismax}*:*

Not quite - there is a very subtle distinction.

{!dismax}  is short for {!type=dismax}, the type of the actual query,
and this may not be overridden.

The defType local param is only the default type for sub-queries (as
opposed to the current query).
It's useful in conjunction with the query  or nested query qparser:
http://lucene.apache.org/solr/api/org/apache/solr/search/NestedQParserPlugin.html

-Yonik
http://www.lucidimagination.com


Re: NRT and commit behavior

2011-07-18 Thread Jonathan Rochkind
In practice, in my experience at least, a very 'expensive' commit can 
still slow down searches significantly, I think just due to CPU (or 
i/o?) starvation. Not sure anything can be done about that.  That's my 
experience in Solr 1.4.1, but since searches have always been async with 
commits, it probably is the same situation even in more recent versions, 
I'd guess.


On 7/18/2011 11:07 AM, Yonik Seeley wrote:

On Mon, Jul 18, 2011 at 10:53 AM, Nicholas Chasench...@earthlink.net  wrote:

Very glad to hear that NRT is finally here!  But my question is this: will
things still come to a standstill during a commit?

New updates can now proceed in parallel with a commit, and
searches have always been completely asynchronous w.r.t. commits.

-Yonik
http://www.lucidimagination.com



RE: Uninstall Solr

2011-07-01 Thread Jonathan Rochkind
There's no general documentation on that, because it depends on exactly what 
container you are using (Tomcat? Jetty? Something else?) and how you are using 
it.  It is confusing, but blame Java for that, nothing unique to Solr. 

So since there's really nothing unique to Solr here, you could try looking up 
documentation on the particular container you are using and how you undeploy 
.war's from it, or asking on lists related to that documentation. 

But it's also possible someone here would be able to help you out, but you'd 
have to provide more information about what container you are using, and 
ideally what you did in the first place to install it. 

Jonathan

From: gauravpareek2...@gmail.com [gauravpareek2...@gmail.com]
Sent: Friday, July 01, 2011 4:41 AM
To: erik.hatc...@gmail.com; solr-user@lucene.apache.org
Subject: Re: Uninstall Solr

Hello Erik,
thank u for ur help.

I understand that we need to delete the folder but how undeploy the solr.war 
and where i can find it.
If anyone can send me the document to unisnatll solr software will be great.

Regards,
Gaurav Pareek
--
Sent via Nokia Email

--Original message--
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Thursday, June 30, 2011 8:10:48 PM GMT-0400
Subject: Re: Uninstall Solr

How'd you install it?

Generally you just delete the directory where you installed it.  But you 
might be deploying solr.war in a container somewhere besides Solr's example 
Jetty setup, in which case you need to undeploy it from those other containers 
and remove the remnants.

Curious though... why uninstall it?  Solr makes a mighty fine hammer to have 
around :)

Erik

On Jun 30, 2011, at 19:49 , GAURAV PAREEK wrote:

 Hi All,

 How to *uninstall* Solr completely ?

 Any help will be appreciated.

 Regards,
 Gaurav




Re: Index Version and Epoch Time?

2011-06-28 Thread Jonathan Rochkind

On 6/28/2011 1:38 PM, Pranav Prakash wrote:

- Will the commit by incremental indexer script also commit the
previously uncommitted changes made by full indexer script before it broke?


Yes, as long as the Solr instance hasn't crashed.  Anything added but 
not yet committed sticks around and will be committed on next 'commit'. 
There are no 'transactions' for adding docs in Solr, even if multiple 
processes are adding, if anyone of them issues a 'commit' they'll all be 
committed.



Sometimes, while during execution, Solr's avg response time 9avg resp time
for last 10 requests, read from log file) goes as high as 9000ms (which I am
still unclear why, any ideas how to start hunting for the problem?),


It could be a Java garbage collection issue. I have found it useful to 
start the JVM with Solr in it using some parameters to tune garbage 
collection. I use these JVM options:
 -server -XX:+AggressiveOpts -d64 -XX:+UseConcMarkSweepGC 
-XX:+UseCompressedOops


You've still got to make sure Solr has enough memory for what you're 
doing with it, with with your 5 million doc index might be more than you 
expect. On the other hand, giving a JVM too _much_ heap can cause 
slowdowns too, although I think the -XX:+UseConcMarkSweepGC should 
amelioerate that to some extent.


Possibly more likely, it could instead be Solr readying the new indexes. 
Do you issue commits in the middle of 'execution', and could the 
slowdown happen right after a commit?  When a commit is issued to Solr, 
Solr's got to switch new indexes in with the newly added documents, and 
'warm' those indexes in various ways. Which can be a CPU (as well as 
RAM) intensive thing. (For these purposes a replication from master 
counts as a commit (because it is), and an optimize can count too 
(becaue it's close enough)).


This can be especially a problem if you issue multiple commits very 
close together -- Solr's still working away at readying the index from 
the first commit, when the second comes in, and now Solr's trying to get 
ready two indexes at once (one of which will never be used because its' 
already outdated).  Or even more than two if you issue a bunch of 
commits in rapid succession.






  I found that the uncommitted changes were
applied and searchable. However, the updates were uncommitted.


There is in general no way that uncomitted adds could be searchable, 
that's probably not happening.   What is probably happening instead is 
that a commit _is_ happening.  One way a commit can happen even if you 
aren't manually issuing one is with various auto-commit settings in 
solrconfig.xml.  Commit any pending adds after X documents, or after T 
seconds, can both be configured. If they are configured, that could be 
causing commits to happen when you don't realize it, which could also 
trigger the slowdown due to a commit mentioned in the previous paragraph.


Jonathan



Re: moving to multicore without changing existing index

2011-06-28 Thread Jonathan Rochkind
Nope. But you can move your existing index into a core in a multi-core 
setup.  But a multi-core setup is a multi-core setup, there's no way to 
have an index accessible at a non-core URL in a multi-core setup.


On 6/28/2011 2:53 PM, lee carroll wrote:

hi
I'm looking at setting up multi core indices but also have an exiting
index. Can I run
this index along side new index set up as cores. On a dev  machine
I've experimented with
simply adding solr.xml in slor home and listing the new cores in the
cores element but this breaks the existing
index.

container is tomcat and attempted set up was:

solrHome
 conf (existing running index)
 core1 (new core directory)
 solr.xml (cores element has one entry for core1)

Is this a valid approach ?

thanks lee



Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-22 Thread Jonathan Rochkind

Yeah, I see your points. It's complicated. I'm not sure either.

But the thing is:

 in order to use a feature like that you'd have to really think hard 
about

 the query analysis of your fields, and which ones will produce which
 tokens in which situations

You need to think really hard about the (index and query) analysis of 
your fields and which ones will produce which tokens _now_, if you are 
using multiple fields in a 'qf' with differing analysis, and using a 
percent mm. (Or similarly an mm that varies depending on how many terms).


That's what I've come to realize, that's the status quo. If your qf 
fields don't all have identical analysis, right _now_ you need to think 
really hard about the analysis and how it's going to possibly effect 
'mm', including for edge case queries.  If you don't, you likely have 
edge case queries (at least) which aren't behaving how you expected 
(whether you notice or have it brought to your attention by users or not).


Or you can just make sure all fields in your qf have identical analysis, 
and then you don't have to worry about it. But that's not always 
practical, a lot of the power of dismax qf ends up being combining 
fields with different analysis.


So I was trying to think of a way to make this less so, but still be 
able to take advantage of dismax, but I think you're right that maybe 
there isn't any, or at least nothing we've come up with yet.


Maybe what I really need is a query parser that does not do disjunction 
maximum at all, but somehow still combines different 'qf' type fields 
with different boosts on each field. I personally don't _neccesarily_ 
need the actual disjunction max calculation, but I do need combining 
of mutiple fields with different boosts. Of course, I'm not sure exactly 
how it would combine multiple fields if not disjunction maximum, but 
perhaps one is conceivable that wouldn't be subject to this particular 
gotcha with differing analysis.


I also remain kind of confused about how the existing dismax figures out 
how many terms for the 'mm' type calculations. If someone wanted to 
explain that,  I would find it enlightening and helpful for 
understanding what's going on.


Jonathan

On 6/21/2011 10:20 PM, Chris Hostetter wrote:

: not other) setups/intentions.  It's counter-intuitive to me that adding
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set

agreed .. but that's where looking the debug info comes in to understand
the reason for that behavior is that your old qf treated part of your
input as garbage and that new field respects it and uses it in the
calculation.

mind you: the fewer hits behavior only happens when using a percentage
value in mm ... if you had mm=2 you'd get more results, but you've asked
for 66% (or whatever) and with that new qf there is a differnet number
of clauses produced by query parsing.

: I wonder if it would be a good idea to have a parameter to (e)dismax
: that told it which of these two behaviors to use? The one where the
: 'term count' is based on the maximum number of terms from any field in
: the 'qf', and one where it's based on the minimum number of terms
: produced from any field in the qf?  I am still not sure how feasible

even in your use case, i don't think you are fully considering what that
would produce.  imagine that an mmType=min param existed and gave you what
you're asking for.  Now imagine that you have two fields, one named
simple that strips all punctuation and one named complex that doesn't,
and you have a query like this...

q=Foo  Bar
qf=simple complex
mm=100%
mmType=min

   * Foo produces tokens for all qf
   *  only produces tokens for some qf (complex)
   * Bar products tokens for all qf

your mmType would say there are only 2 tokens that we can query across
all fields, so our computed minShouldMatch should be 100% of 2 == 2

sounds good so far right?

the problem is you still have query clause coming from that 
character ... you have 3 real clauses, one of which is that term query for
complex: which means that with your (computed) minShouldMatch of 2 you
would see matches for any doc that happened to have indexed the  symbol
in the complex field and also matched *either* of Foo or Bar (in either
field)

So while a lot of your results would match both Foo and Bar, you'd get
still get a bunch of weird results.

: Or maybe a feature where you tell dismax, the number of tokens produced
: by field X, THAT's the one you should use for your 'term count' for mm,

Hmmm maybe.  i'd have to see a patch in action and play with it, to
really think it through ... hmmm ... honestly i really can't imagine how
that would be helpful in general...

in order to use a feature like that you'd have to really think hard about
the query analysis of your fields, and which ones will produce which
tokens in which situations in order to make sure you pick the *right*
value for that param -- but once you've done that hard 

Re: MultiValued facet behavior question

2011-06-22 Thread Jonathan Rochkind
Okay, so since you put cardiologist in the 'q', you only want facet 
values that have 'cardiologist' (or 'Cardiologist') to show in up the 
facet list.


In general, there's no good way to do that.

But.

If you want to do some client-side processing before you submit the 
query to Solr, and on the client side you can figure out exactly what 
you want: then you could try to play around with facet.filter or 
facet.query, to see if you can make it do what you want. It may or may 
not work out, depending on exactly your use pattern, which you still 
haven't articulated very well, but you can mess around with it and see 
what you can do.


Ie, if you KNOW (that is, your own app code knows, when creating the 
Solr request) that you only want the facet value for Cardiologist 
(including exact case), you can try facet.query=specialty:Cardiologist


Your app code would have to pull out the results special too, they won't 
be in the Solr response in same way ordinary facet.field is. It also 
requires your query value to match _exactly_ (case, punctuation, etc) 
the value in the index. Not cardiologist and Cardiologist.


I think Solr 3.1 has some regex based facet.filter abilities that might 
be useful, and help you get around the 'exact match' issues, but watch 
out for performance.





On 6/21/2011 11:37 PM, Bill Bell wrote:

Doing it with q=specialities:Cardiologist or
q=CardiologistdefType=dismaxqf=specialties
does not matter, the issue is how I see facets. I want the facets to only
show the one match,
and not all the multiValued fields in specialties that match...

Example,

Name|specialties
Bell|Cardiologist
Smith|Cardiologist,Family Doctor
Adams,Cardiologist,Family Doctor,Internist

When I facet.field=specialties I get:

Cardiologist: 3
Internist: 1
Family Doctor: 1


I only want it to return:

Cardiologist: 3

Because this matches exactly... Facet on the field that matches and only
return the number for that.

It can get more complicated. Here is another example:

q=cardiologydefType=dismaxqf=specialties


(Cardiology and cardiologist are stems)...

But I don't really know which value in Cardiologist match perfectly.

Again, I only want it to return:

Cardiologist: 3

If I searched on q=internistdefType=dismaxqf=specialties, I want the
result to be:


Internist: 1


Does this all make sense?







On 6/21/11 8:23 PM, Darren Govonidar...@ontrenet.com  wrote:


So are you saying that for all results for cardiologist,
you don't want facets not matching Cardiologist to be
returned as facets?

what happens when you make q=specialities:Cardiologist?
instead of just q=Cardiologist?

Seems that if you make the query on the field, then all
your results will necessarily qualify and you can discard
any additional facets you don't want (e.g. that don't
match the initial query term).

Maybe you can write what you see now, with what you
want to help clarify.

On 06/21/2011 09:47 PM, Bill Bell wrote:

I have a field: specialties that is multiValued.

It indicates the doctor's specialties: cardiologist, internist, etc.

When someone does a search: Cardiologist, I use

q=cardiologistdefType=dismaxqf=specialtiesfacet=truefacet.field=speci
alt
ies

What I want to come out in the facet is the Cardiologist (since it
matches
exactly) and the number that matches: 700.
I don't want to see the other values that are not Cardiologist.

Now I see:

Cardiologist: 700
Internist: 45
Family Doctor: 20

This means that several Cardiologist's are also internists and family
doctors. When it matches exactly, I don't want to see Internists, Family
Doctors. How do I send a query to Solr with a condition.
Facet.query=specialties:Cardiologistfacet.field=specialties

Then if the query returns something use it, otherwise use the field one?

Other ideas?









RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-21 Thread Jonathan Rochkind
Thanks, that's helpful. 

It still seems like current behavior does the wrong thing in _many_ cases (I 
know a lot of people get tripped up by it, sometimes on this list) -- but I 
understand your cases where it does the right thing, and where what I'm 
suggesting would be the wrong thing. 

 Ultimately the problem you had with  is the same problem people have 
 with stopwords, and comes down to the same thing: if you don't want some 
 chunk of text to be significant when searchng a field in your qf, have 
 your analyzer remove it 

Ah, but see the problem people have with stopwords is when they actually DID 
that. They didn't want a term to be 'significant' in one field, but they DID 
want it to be 'significant' in another field... but how this effects the 'mm' 
ends up being kind of counter-intuitive for some (but not other) 
setups/intentions.   It's counter-intuitive to me that adding a field to the 
'qf' set results in _fewer_ hits than the same 'qf' set without the new field 
-- although I understand your cases where you added the field to the 'qf' 
precisely in order to intentionally get that behavior, that's definitely not a 
universal case. 

And the fact that unpredictable changes to field analysis that aren't as simple 
as stopwords can lead to this same problem (as in this case where one field 
ignores punctuation and the other doesn't) -- it's definitely a trap waiting 
for some people. 

I wonder if it would be a good idea to have a parameter to (e)dismax that told 
it which of these two behaviors to use? The one where the 'term count' is based 
on the maximum number of terms from any field in the 'qf', and one where it's 
based on the minimum number of terms produced from any field in the qf?  I am 
still not sure how feasible THAT is, but it seems like a good idea to me. The 
current behavior is definitely a pitfall for many people.  

Or maybe a feature where you tell dismax, the number of tokens produced by 
field X, THAT's the one you should use for your 'term count' for mm, all the 
other fields are really just in there as sort of supplementary -- for boosting, 
or for bringing a few more results in; but NOT the case where you intentionally 
add a 'qf' with KeepWordsFilter in order to intentionally _reduce_ the result 
set . I think that's a pretty common use case too. 

Jonathan


Re: getting started

2011-06-16 Thread Jonathan Rochkind

On 6/16/2011 4:41 PM, Mari Masuda wrote:

One reservation I have is that eventually we would like to be able to type in Iraq and 
find records across all of the collections at once instead of having to search each collection 
separately.  Although I don't know anything about it at this stage, I did Google 
sharding after reading someone's recent post on this list and it sounds like that may 
be a potential answer to my question.


So this kind of stuff can be tricky, but with that eventual requirement 
I would NOT put these in seperate cores. Sharding isn't (IMO, if someone 
disagrees, they will hopefully say so!) a good answer to searching 
accross entirely different 'schemas', or avoiding frequent-commit issues 
-- sharding is really just for scaling/performance when your index gets 
very very large. (Which it doesn't sound like yours will be, but you can 
deal with that as a separate issue if it becomes so).


If you're going to want to search across all the collections, put them 
all in the same core.  Either in the exact same indexed fields, or using 
certain common indexed fields -- those common ones are the ones you'll 
be able to search across all collections on. It's okay if some 
collections have unique indexed fields too --- documents in the core 
that don't belong to that collection just won't have any terms in that 
indexed field that is only used by a certain collection, no problem. 
(Then you can distribute this single core into shards if you need to for 
performance reasons related to number of documents/size of index).


You're right to be thinking about the fact that very frequent commits 
can be performance issues in Solr. But separating in different cores is 
going to create more problems for yourself (if you want to be able to 
search accross all collections), in an attempt to solve that one.  
(Among other things, not every Solr feature works in a 
distributed/sharded environment, it's just a more complicated and 
somewhat less mature setup for Solr).


The way I deal with the frequent-commit issue is by NOT doing frequent 
commits to my production Solr. Instead, I use Solr replication to have a 
'master' Solr index that I do commits to whenever I want, and a 'slave' 
Solr index that serves the production searches, and which only 
replicates from master periodically -- not too often to be 
too-frequent-commits.  That seems to be a somewhat common solution, if 
that use pattern works for you.


There are also some near real time features in more recent versions of 
Solr, that I'm not very familiar with. (not sure if any are included in 
the current latest release, or if they are all only still in the repo)  
My sense is that they too only work for certain use patterns, they 
aren't magic bullets for commit whatever you want as often as you want 
to Solr.  In general Solr isn't so great at very frequent major changes 
to the index.   Depending on exactly what sort of use pattern you are 
predicting/planning for your commits, maybe people can give you advice 
on how (or if) to do it.


But I personally don't think your idea of splitting your collections 
(that you'll eventually want to search accross into a single search) 
into shards is a good solution to frequent-commit issues. You'd be 
complicating your setup and causing other problems for yourself, and not 
really even entirely addressing the too-frequent-commit issue with that 
setup.


Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-15 Thread Jonathan Rochkind
Okay, I figured this one out -- I'm participating in a thread with 
myself here, but for benefit of posterity, or if anyone's interested, 
it's kind of interesting.


It's actually a variation of the known issue with dismax, mm, and fields 
with varying stopwords. Actually a pretty tricky problem with dismax, 
which it's now clear goes way beyond just stopwords.


http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

So to understand, first familiarize yourself with that.

However, none of the fields involved here had any stopwords at all, so 
at first it wasn't obvious this was the problem. But having different 
tokenization and other analysis between fields can result in exactly the 
same problem, for certain queries.


One field in the dismax qf used an analyzer that stripped punctuation. 
(I'm actually not positive at this point _which_ analyzer in my chain 
was stripping punctuation, I'm using a bunch including some custom ones, 
but I was aware that punctuation was being stripped, this was intentional.)


So monkey's turns into monkey.  monkey: turns into monkey.  So 
far so good. But what happens if you have punctuation all by itself 
seperated by whitespace?  Roosevlet  Churchill turns into 
['roosevelt', 'churchill'].  That ampersand in the middle was stripped 
out, essentially _just as if_ it were a stopword. Only two tokens result 
from that input.


You can see where this is going -- another field involved in the dismax 
qf did NOT strip out punctuation. So three tokens result from that 
input, ['Roosevelt', '', 'Churchill'].


Now we have exactly the situation that gives ride the dismax stopwords 
mm-behaving-funny situation, it's exactly the same thing.


Now I've fixed this for punctuation just by making those fields strip 
out punctuation, by adding these analyzers to the bottom of those 
previously-not-stripping-punctuation field definitions:


!-- strip punctuation, to avoid dismax stopwords-like mm bug --
filter class=solr.PatternReplaceFilterFactory
pattern=([\p{Punct}]) replacement= replace=all
/
!-- if after stripping punc we have any 0-length tokens, make
  sure to eliminate them. We can use LengthFilter min=1 for 
that,
  we dont' care about the max here, just a very large 
number. --

filter class=solr.LengthFilterFactory min=1 max=100/


And things are working are how I expect again, at least for this 
punctuation issue. But there may be other edge cases where differences 
in analysis result in different number of tokens from different fields, 
which if they are both included in a dismax qf, will have bad effects on 
'mm'.


The lesson I think, is that the only absolute safe way to use dismax 
'mm', is when all fields in the 'qf' have exactly the same analysis.  
But obviously that's not very practical, it destroys much of the power 
of dismax. And some differences in analysis are certainly acceptable -- 
but it's rather tricky to figure out if your differences in analysis are 
going to be significant for this problem, under what input, and if so 
fix them. It is not an easy thing to do.  So dismax definitely has this 
gotcha potentially waiting for you, whenever mixing fields with 
different analysis in a 'qf'.



On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:

Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:

q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf= 



str name=rawquerystringchurchill : roosevelt/str
str name=querystringchurchill : roosevelt/str
str name=parsedquery
+((DisjunctionMaxQuery((title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()

/str
str name=parsedquery_toString
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
/str

And that gets 25 hits. Now we add in a second field to the qf, this 
second field is also ordinarily tokenized. We expect no _fewer_ than 
25 hits, adding another field into qf, right? And indeed it still 
results in exactly 25 hits (no additional hits from the additional qf 
field).


?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf= 



str name=parsedquery
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title2_t:roosevelt | 
title1_t:roosevelt)~0.01))~2) ()

/str
str name=parsedquery_toString
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | 
title1_t:roosevelt)~0.01)~2) ()

/str



Okay, now we go back to just that first (ordinarily tokenized) field, 
but add a second field in that uses KeywordTokenizerFactory.  We 
expect this not neccesarily to ever match for a multi-word query, but 
we don't expect it to be fewer than 25 hits, the 25 hits from the 
first field in the qf should still be there, right? But it's not. What 
happened, why

Re: Multiple indexes

2011-06-15 Thread Jonathan Rochkind
Next, however, I predict you're going to ask how you do a 'join' or 
otherwise query accross both these cores at once though. You can't do 
that in Solr.


On 6/15/2011 1:00 PM, Frank Wesemann wrote:

You'll configure multiple cores:
http://wiki.apache.org/solr/CoreAdmin

Hi.

How to have multiple indexes in SOLR, with different fields and
different types of data?

Thank you very much!
Bye.





Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-15 Thread Jonathan Rochkind
 different fields, which if they
are both included in a dismax qf, will have bad effects on 'mm'.

The lesson I think, is that the only absolute safe way to use dismax 'mm',
is when all fields in the 'qf' have exactly the same analysis.  But
obviously that's not very practical, it destroys much of the power of
dismax. And some differences in analysis are certainly acceptable -- but
it's rather tricky to figure out if your differences in analysis are going
to be significant for this problem, under what input, and if so fix them. It
is not an easy thing to do.  So dismax definitely has this gotcha
potentially waiting for you, whenever mixing fields with different analysis
in a 'qf'.


On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:

Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:


q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf=

str name=rawquerystringchurchill : roosevelt/str
str name=querystringchurchill : roosevelt/str
str name=parsedquery
+((DisjunctionMaxQuery((title1_t:churchil)~0.01)
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
/str
str name=parsedquery_toString
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
/str

And that gets 25 hits. Now we add in a second field to the qf, this second
field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
adding another field into qf, right? And indeed it still results in exactly
25 hits (no additional hits from the additional qf field).


?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf=

str name=parsedquery
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
/str
str name=parsedquery_toString
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
title1_t:roosevelt)~0.01)~2) ()
/str



Okay, now we go back to just that first (ordinarily tokenized) field, but
add a second field in that uses KeywordTokenizerFactory.  We expect this not
neccesarily to ever match for a multi-word query, but we don't expect it to
be fewer than 25 hits, the 25 hits from the first field in the qf should
still be there, right? But it's not. What happened, why not?


q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20isbn_tmm=100%debugQuery=truepf=


str name=rawquerystringchurchill : roosevelt/str
str name=querystringchurchill : roosevelt/str
str name=parsedquery+((DisjunctionMaxQuery((isbn_t:churchill |
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
()/str
str name=parsedquery_toString+(((isbn_t:churchill |
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
title1_t:roosevelt)~0.01)~3) ()/str



On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:

I'm aware that using a field tokenized with KeywordTokenizerFactory is in
a dismax 'qf' is often going to result in 0 hits on that field -- (when a
whitespace-containing query is entered).  But I do it anyway, for cases
where a non-whitespace-containing query is entered, then it hits.  And in
those cases where it doesn't hit, I figure okay, well, the other fields in
qf will hit or not, that's good enough.

And usually that works. But it works _differently_ when my query contains
an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
and I can't figure out why.

basically,

defType=dismaxmm=100%q=one : twoqf=text_field

gets hits.  The : is thrown out the text_field, but the mm still passes
somehow, right?

But, in the same index:

defType=dismaxmm=100%q=one : twoqf=text_field
keyword_tokenized_text_field

gets 0 hits.  Somehow maybe the inclusion of the
keyword_tokenized_text_field in the qf causes dismax to calculate the mm
differently, decide there are three tokens in there and they all must match,
and the token : can never match because it's not in my index it's stripped
out... but somehow this isn't a problem unless I include a keyword-tokenized
  field in the qf?

This is really confusing, if anyone has any idea what I'm talking about
it and can shed any light on it, much appreciated.

The conclusion I am reaching is just NEVER include anything but a more or
less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
certain use cases for me.

Oh, hey, the debugging trace woudl probably be useful:


lstname=debug
strname=rawquerystring
churchill : roosevelt
/str
strname=querystring
churchill : roosevelt
/str
strname=parsedquery
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:churchill
roosevelt~3^240.0 | text:churchil roosevelt~3^10.0 | title2_t:churchil
roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 |
title_exactmatch:churchill roosevelt^500.0

ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-14 Thread Jonathan Rochkind
I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, the 
other fields in qf will hit or not, that's good enough.


And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits when 
it shoudln't, and I can't figure out why.


basically,

defType=dismaxmm=100%q=one : twoqf=text_field

gets hits.  The : is thrown out the text_field, but the mm still 
passes somehow, right?


But, in the same index:

defType=dismaxmm=100%q=one : twoqf=text_field 
keyword_tokenized_text_field


gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the mm 
differently, decide there are three tokens in there and they all must 
match, and the token : can never match because it's not in my index 
it's stripped out... but somehow this isn't a problem unless I include a 
keyword-tokenized  field in the qf?


This is really confusing, if anyone has any idea what I'm talking about 
it and can shed any light on it, much appreciated.


The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was useful 
for certain use cases for me.


Oh, hey, the debugging trace woudl probably be useful:


lstname=debug
strname=rawquerystring
churchill : roosevelt
/str
strname=querystring
churchill : roosevelt
/str
strname=parsedquery
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:churchill roosevelt~3^240.0 | 
text:churchil roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 
| author_unstem:churchill roosevelt~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:churchil 
roosevelt~3^60.0 | title1_unstem:churchill roosevelt~3^320.0 | 
author2_unstem:churchill roosevelt~3^240.0 | title3_unstem:churchill 
roosevelt~3^80.0 | subject_t:churchil roosevelt~3^10.0 | 
other_number_unstem:churchill roosevelt~3^40.0 | 
subject_unstem:churchill roosevelt~3^80.0 | title_series_t:churchil 
roosevelt~3^40.0 | title_series_unstem:churchill roosevelt~3^60.0 | 
text_unstem:churchill roosevelt~3^80.0)~0.01)

/str
strname=parsedquery_toString
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:churchill roosevelt~3^240.0 | text:churchil 
roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 | 
author_unstem:churchill roosevelt~3^400.0 | title_exactmatch:churchill 
roosevelt^500.0 | title1_t:churchil roosevelt~3^60.0 | 
title1_unstem:churchill roosevelt~3^320.0 | author2_unstem:churchill 
roosevelt~3^240.0 | title3_unstem:churchill roosevelt~3^80.0 | 
subject_t:churchil roosevelt~3^10.0 | other_number_unstem:churchill 
roosevelt~3^40.0 | subject_unstem:churchill roosevelt~3^80.0 | 
title_series_t:churchil roosevelt~3^40.0 | 
title_series_unstem:churchill roosevelt~3^60.0 | 
text_unstem:churchill roosevelt~3^80.0)~0.01

/str
lstname=explain/
strname=QParser
DisMaxQParser
/str
nullname=altquerystring/
nullname=boostfuncs/
lstname=timing
doublename=time
6.0
/double
lstname=prepare
doublename=time
3.0
/double
lstname=org.apache.solr.handler.component.QueryComponent
doublename=time
2.0
/double
/lst
lstname=org.apache.solr.handler.component.FacetComponent
doublename=time
0.0
/double
/lst
lstname=org.apache.solr.handler.component.MoreLikeThisComponent
doublename=time
0.0
/double
/lst
lstname=org.apache.solr.handler.component.HighlightComponent
doublename=time
0.0
/double
/lst
lstname=org.apache.solr.handler.component.StatsComponent
doublename=time
0.0
/double
/lst
lstname=org.apache.solr.handler.component.SpellCheckComponent
doublename=time
0.0
/double
/lst
lstname=org.apache.solr.handler.component.DebugComponent
doublename=time
0.0
/double
/lst
/lst





Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-14 Thread Jonathan Rochkind

Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:

q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf=

str name=rawquerystringchurchill : roosevelt/str
str name=querystringchurchill : roosevelt/str
str name=parsedquery
+((DisjunctionMaxQuery((title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()

/str
str name=parsedquery_toString
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
/str

And that gets 25 hits. Now we add in a second field to the qf, this 
second field is also ordinarily tokenized. We expect no _fewer_ than 25 
hits, adding another field into qf, right? And indeed it still results 
in exactly 25 hits (no additional hits from the additional qf field).


?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf=

str name=parsedquery
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()

/str
str name=parsedquery_toString
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | 
title1_t:roosevelt)~0.01)~2) ()

/str



Okay, now we go back to just that first (ordinarily tokenized) field, 
but add a second field in that uses KeywordTokenizerFactory.  We expect 
this not neccesarily to ever match for a multi-word query, but we don't 
expect it to be fewer than 25 hits, the 25 hits from the first field in 
the qf should still be there, right? But it's not. What happened, why not?


q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20isbn_tmm=100%debugQuery=truepf=


str name=rawquerystringchurchill : roosevelt/str
str name=querystringchurchill : roosevelt/str
str name=parsedquery+((DisjunctionMaxQuery((isbn_t:churchill | 
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
()/str
str name=parsedquery_toString+(((isbn_t:churchill | 
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | 
title1_t:roosevelt)~0.01)~3) ()/str




On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, 
the other fields in qf will hit or not, that's good enough.


And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits 
when it shoudln't, and I can't figure out why.


basically,

defType=dismaxmm=100%q=one : twoqf=text_field

gets hits.  The : is thrown out the text_field, but the mm still 
passes somehow, right?


But, in the same index:

defType=dismaxmm=100%q=one : twoqf=text_field 
keyword_tokenized_text_field


gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the 
mm differently, decide there are three tokens in there and they all 
must match, and the token : can never match because it's not in my 
index it's stripped out... but somehow this isn't a problem unless I 
include a keyword-tokenized  field in the qf?


This is really confusing, if anyone has any idea what I'm talking 
about it and can shed any light on it, much appreciated.


The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was 
useful for certain use cases for me.


Oh, hey, the debugging trace woudl probably be useful:


lstname=debug
strname=rawquerystring
churchill : roosevelt
/str
strname=querystring
churchill : roosevelt
/str
strname=parsedquery
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:churchill roosevelt~3^240.0 | 
text:churchil roosevelt~3^10.0 | title2_t:churchil 
roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:churchil 
roosevelt~3^60.0 | title1_unstem:churchill roosevelt~3^320.0 | 
author2_unstem:churchill roosevelt~3^240.0 | 
title3_unstem:churchill roosevelt~3^80.0 | subject_t:churchil 
roosevelt~3^10.0 | other_number_unstem:churchill roosevelt~3^40.0 | 
subject_unstem:churchill roosevelt~3^80.0 | title_series_t:churchil 
roosevelt~3^40.0 | title_series_unstem:churchill roosevelt~3^60.0 | 
text_unstem:churchill roosevelt~3^80.0)~0.01)

/str
strname=parsedquery_toString
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:churchill roosevelt~3^240.0 | text:churchil 
roosevelt~3^10.0

Re: How do I make sure the resulting documents contain the query terms?

2011-06-07 Thread Jonathan Rochkind
Um, normally that would never happen, because, well, like you say, the 
inverted index doesn't have docC for term K1, because doc C didn't 
include term K1.


If you search on q=K1, then how/why would docC ever be in your result 
set?  Are you seeing it in your result set? The question then would be 
_why_, what weird thing is going on to make that happen,  that's not 
expected.


The result set _starts_ from only the documents that actually include 
the term.  Boosting/relevancy ranking only effects what order these 
documents appear in, but there's no reason documentC should be in the 
result set at all in your case of q=k1, where docC is not indexed under k1.


On 6/7/2011 2:35 AM, Gabriele Kahlout wrote:

Sorry being unclear and thank you for answering.
Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3),
where A,B,C are document identifiers and the ks in bracket with each are the
terms each contains.
So Solr inverted index should be something like:

k0 --  A | C
k1 --  A | B
k2 --  A | B | C
k3 --  B | C

Now let q=k1, how do I make sure C doesn't appear as a result since it
doesn't contain any occurence of k1?


Re: Default query parser operator

2011-06-07 Thread Jonathan Rochkind

Nope, not possible.

I'm not even sure what it would mean semantically. If you had default 
operator OR ordinarily, but default operator AND just for field2, 
then what would happen if you entered:


field1:foo field2:bar field1:baz field2:bom

Where the heck would the ANDs and ORs go?  The operators are BETWEEN the 
clauses that specify fields, they don't belong to a field. In general, 
the operators are part of the query as a whole, not any specific field.


In fact, I'd be careful of your example query:
q=field1:foo bar field2:baz

I don't think that means what you think it means, I don't think the 
field1 applies to the bar in that case. Although I could be wrong, 
but you definitely want to check it.  You need field1:foo field1:bar, 
or set the default field for the query to field1, or use parens 
(although that will change the execution strategy and ranking): 
q=field1:(foo bar)   


At any rate, even if there's a way to specify this so it makes sense, 
no, Solr/lucene doesn't support any such thing.




On 6/7/2011 10:56 AM, Brian Lamb wrote:

I feel like this should be fairly easy to do but I just don't see anywhere
in the documentation on how to do this. Perhaps I am using the wrong search
parameters.

On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb
brian.l...@journalexperts.comwrote:


Hi all,

Is it possible to change the query parser operator for a specific field
without having to explicitly type it in the search field?

For example, I'd like to use:

http://localhost:8983/solr/search/?q=field1:word token field2:parser
syntax

instead of

http://localhost:8983/solr/search/?q=field1:word AND token field2:parser
syntax

But, I only want it to be applied to field1, not field2 and I want the
operator to always be AND unless the user explicitly types in OR.

Thanks,

Brian Lamb



  1   2   3   4   5   >