from:"Jonathan Rochkind"


Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then 
query for mixedCase will no longer also match mixed Case.


I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase 
to match both/either mixed Case or mixedCase in the index. (with 
case insensitivity on top of that via another filter).


That would support things like names like duBois which are sometimes 
spelled du bois and sometimes dubois, and allow the query duBois 
to match both in the index.


I had somehow thought that was what WDF was intended for. But it's 
actually not the usual functioning, and may not be realistic?


I'm a bit confused about what splitOnCaseChange combined with 
catenateWords is meant to do at all.  It _is_ generating both the split 
and single-word tokens at query time -- but not in a way that actually 
allows it to match both the split and single-word tokens?  What is 
supposed to be the purpose/use case for splitOnCaseChange with 
catenateWords? If any?


Jonathan

On 12/29/14 7:20 PM, Erick Erickson wrote:

Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

On 12/29/14 5:24 PM, Jack Krupansky wrote:


WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction



I do not understand what separate query/index analysis you are suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson agreed
this was an intended use case for the WDF, but maybe I explained it poorly.
Erick if you're around and want to at least confirm whether WDF is supposed
to do this in your understanding, that would be great!

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

I guess I don't understand what the four use cases are, or the three out 
of four use cases, or whatever. What the intended uses of the WDF are.


Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time), 
or is that a nonsensical configuration that nobody should ever use?


I understand how analysis can be different at index vs query time. I 
think what I don't fully understand is what the possibilities and 
intended use case of the WDF are, with various configurations.


I thought one of the intended use cases, with appropriate configuration, 
was to do what I'm talking: allow mixedCase query to match both mixed 
Case and mixed Case in the index. I think you're saying I'm wrong, 
and this is not something WDF can do? Can you confirm I understand you 
right?


Thanks!

Jonathan

On 12/30/14 11:30 AM, Jack Krupansky wrote:

Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:


Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:


Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:


On 12/29/14 5:24 PM, Jack Krupansky wrote:



WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that distinction




I do not understand what separate query/index analysis you are
suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same
query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I
just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson
agreed
this was an intended use case for the WDF, but maybe I explained it
poorly.
Erick if you're around and want to at least confirm whether WDF is
supposed
to do this in your understanding, that would be great!

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results


On 12/30/14 11:45 AM, Alexandre Rafalovitch wrote:

On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote:

I'm a bit confused about what splitOnCaseChange combined with catenateWords
is meant to do at all.  It _is_ generating both the split and single-word
tokens at query time


Have you tried only having WDF during indexing with both options set?
And same chain but without WDF at all during query?


Without WDF at all in the query, then mixedCase in query would match 
mixedCase in index, but would no longer match mixed Case in index.


I thought I was using WDF in such a way that mixedCase in query could 
match both/either mixedCase and/or mixed Case in the index. And I 
thought this was an intended use case of the WDF.


But perhaps I was wrong, and the WDF simply can't do this?  Is WDF 
intended mainly for use at index time and not query time? In general, 
I'm confused about the various things WDF can and can't do, and the 
various configurations to make it do that.


Thanks for everyone's advice.

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Okay, thanks. I'm not sure if it's my lack of understanding, but I feel
like I'm having a very hard time getting straight answers out of you
all, here.

I want the query mixedCase to match both/either mixed Case and
mixedCase in the index.

What configuration of WDF at index/query time would do this?

This isn't neccesarily the only thing I want WDF to do, but it's
something I want it to do and thought it was doing and found out it
wasn't. So we can isolate/simplify to there -- if I can figure out what
WDF configuration (if any?) can do that first, then I can always move on
to figuring out how/if that impacts the other things I want WDF to do.

So is there a WDF configuration that can do that? Or is the problem that
it's confusing, and none of you all are sure either if there is what it
would be, it's not clear?

Jonathan

On 12/30/14 12:02 PM, Jack Krupansky wrote:

I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You're not wrong about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, expert
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.

-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

I guess I don't understand what the four use cases are, or the three out
of four use cases, or whatever. What the intended uses of the WDF are.

Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time),
or is that a nonsensical configuration that nobody should ever use?

I understand how analysis can be different at index vs query time. I think
what I don't fully understand is what the possibilities and intended use
case of the WDF are, with various configurations.

I thought one of the intended use cases, with appropriate configuration,
was to do what I'm talking: allow mixedCase query to match both mixed
Case and mixed Case in the index. I think you're saying I'm wrong, and
this is not something WDF can do? Can you confirm I understand you right?

Thanks!

Jonathan

On 12/30/14 11:30 AM, Jack Krupansky wrote:

Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.

-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase
to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all. It _is_ generating both the split
and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens? What is supposed to
be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan

On 12/29/14 7:20 PM, Erick Erickson wrote:

Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick

On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that
distinction

I do not understand what

Re: WordDelimiter filter, expanding to multiple words, unexpected results


On 12/30/14 12:35 PM, Walter Underwood wrote:

You want preserveOriginal=“1”.

You should only do this processing at index time.


If I only do this processing at index time, then mixedCase at query 
time will no longer match mixed Case in the index/source material.


I think I'm having trouble explaining. Let's say the source material 
being indexed included mixed Case, not mixedCase.  I want 
mixedCase in query to still match it.


But if the source material that went into the index contained 
mixedCase, I still want mixedCase in query to match it as well.

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind

Okay, some months later I've come back to this with an isolated 
reproduction case. Thanks very much for any advice or debugging help you 
can give.


The WordDelimiter filter is making a mixed-case query NOT match the 
single-case source, when it ought to.


I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no 
sense to debug here, and I need to install and try to reproduce on a 
more recent version).


I have an index that includes ONE document (deleted and reindexed after 
index change), with content in only one field (text) other than 'id', 
and that content is one word: delalain.


My analysis (both index and query, I don't have different ones) for the 
'text' field is simply:


fieldType name=text class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true

  analyzer
tokenizer class=solr.ICUTokenizerFactory /

filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 catenateWords=1 splitOnCaseChange=1/


filter class=solr.ICUFoldingFilterFactory /
  /analyzer
/fieldType

I am querying simply with eg /select?defType=luceneq=text%3Adelalain

Querying for delalain finds this document, as expected. Querying for 
DELALAIN finds this document, as expected (note the ICUFoldingFactory).


However, querying for deLALAIN does not find this document, which is 
unexpected.


INDEX analysis of the source, delalain, ends in this in the index, 
which seems pretty straightforward, so I'll only bother pasting in the 
final index analysis:


##
textdelalain
raw_bytes   [64 65 6c 61 6c 61 69 6e]
position1
start   0
end 8
typeALPHANUM
script  Latin
###




QUERY analysis of the problematic query, deLALAIN, looks like this:

#
ICUTtextdeLALAIN
raw_bytes   [64 65 4c 41 4c 41 49 4e]   
start   0   
end 8   
typeALPHANUM
script  Latin   
position1   


WDF textde  LALAIN  deLALAIN
raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 
4e]
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
position1   2   2
script  Common  Common  Common


ICUFF   textde  lalain  delalain
raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 
6e]
position1   2   2
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
script  Common  Common  Common
###



It's obviously the WordDelimiterFilter that is messing things up -- but 
how/why, and is it a bug?


It wants to search for both de lalain as a phrase, as well as 
alternately delalain as one word -- that's the intended supported 
point of the WDF with this configuration, right? And should work?


The problem is that is not succesfully matching delalain as one word 
-- so, how to figure out why not and what to do about it?


Previously, Erick and Diego asked for the info from debug=query, so 
here is that as well:



lst name=debug
  str name=rawquerystringtext:deLALAIN/str
  str name=querystringtext:deLALAIN/str
  str name=parsedqueryMultiPhraseQuery(text:de (lalain 
delalain))/str

  str name=parsedquery_toStringtext:de (lalain delalain)/str
  str name=QParserLuceneQParser/str
/lst


Hmm, that does not seem to quite look like neccesarily, if I interpret 
that correctly, it's looking for de followed by either lalain or 
delalain.  Ie, it would match de delalain?  But that's not right at 
all.


So, what's gone wrong? Something with WDF with configuration to 
generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's 
a bug, one that might be fixed in a more recent Solr?).


Thanks!

Jonathan




On 9/3/14 7:15 PM, Erick Erickson wrote:

Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case, I'm
getting confusing results that suggest some other part of my field def may
be pertinent.

I'll come back when I've done that (hopefully next week), and include the
_parsed_ from debug=query then. Thanks!

Jonathan



On 9/2/14 4:26 PM, Erick Erickson wrote:


What happens if you append

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind


On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction


I do not understand what separate query/index analysis you are 
suggesting to accomplish what I wanted.


I understand the WDF, like all software, is not magic, of course. But I 
thought this was an intended use case of the WDF, with those settings:


A mixedCase query would match mixedCase in the index; and the same 
query mixedCase would also match two separate words mixed Case in 
index.  (Case insensitively since I apply an ICUFoldingFilter on top of 
that).


Was I wrong, is this not an intended thing for the WDF to do? Or do I 
just have the wrong configuration options for it to do it? Or is it a bug?


When I started this thread a few months ago, I think Erick Erickson 
agreed this was an intended use case for the WDF, but maybe I explained 
it poorly. Erick if you're around and want to at least confirm whether 
WDF is supposed to do this in your understanding, that would be great!


Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-03 Thread Jonathan Rochkind

Thanks Erick and Diego. Yes, I noticed in my last message I'm not 
actually using defaults, not sure why I chose non-defaults originally.


I still need to find time to make a smaller isolation/reproduction case, 
I'm getting confusing results that suggest some other part of my field 
def may be pertinent.


I'll come back when I've done that (hopefully next week), and include 
the _parsed_ from debug=query then. Thanks!


Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:

What happens if you append debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


On 9/2/14 1:51 PM, Erick Erickson wrote:


bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't
have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?



Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both index
and query phases (is that right?):

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

It's hard to cut and paste the results of the analysis page into email (or
anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
whole real world app complex field definition. I'll also paste in our
entire field definition below. But I realize my next step is probably
creating a simpler isolation/reproduction case (unless you have a magic
answer from this!).

Again, the problem is that MacBook seems to be only matching on indexed
macbook and not indexed mac book.


MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

   fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
   analyzer
!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
 so our synonym filter involving C++ etc can still work.
 From: https://mail-archives.apache.
org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
6070...@elyograg.org%3E
 the rbbi file is in our local ./conf, copied from lucene
source tree --
tokenizer class=solr.ICUTokenizerFactory
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/

filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt
ignoreCase=true/

 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/


 !-- folding need sto be after WordDelimiter, so WordDelimiter
  can do it's thing with full cases and such --
 filter class=solr.ICUFoldingFilterFactory /


 !-- ICUFolding already includes lowercasing, no
  need for seperate lowercasing step
 filter class=solr.LowerCaseFilterFactory/
 --

 filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType

WordDelimiter filter, expanding to multiple words, unexpected results

Hello, I'm running into a case where a query is not returning the 
results I expect, and I'm hoping someone can offer some explanation that 
might help me fine tune things or understand what's up.


I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that 
downcases everything for case-insensitive searching. It includes many 
other things too, but I think these are the pertinent facts.


For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and 
ELALAIN split into two tokens, and another with just one token.


Later, all the tokens are lowercased by another filter in the chain. 
(actually an ICU filter which is doing something more complicated than 
just lowercasing, but I think we can consider it lowercasing for the 
purposes of this discussion).


If I understand right what the WordDelimiterFilter is trying to do here, 
it's probably doing something special because of the lowercase d 
followed by an uppercase letter, a special case for that. (I don't get 
this behavior with other mixed case queries not beginning with 'd').


And, what I think it's trying to do, is match text indexed as d 
elalain as well as text indexed by delalain.


The problem is, it's not accomplishing that -- it is NOT matching text 
that was indexed as delalain (one token).


I don't entirely understand what the position attribute is for -- but 
I wonder if in this case, the position on dELALAIN is really supposed 
to be 1, not 2?  Could that be responsible for the bug?  Or is position 
irrelevant in this case?


If that's not it, then I'm at a loss as to what may be causing this bug 
-- or even if it's a bug at all, or I'm just not understanding intended 
behavior. I expect a query for dELALAIN to match text indexed as 
delalain (because of the forced lowercasing in the filter chain). But 
it's not doing so. Are my expectations wrong? Bug? Something else?


Thanks for any advice,

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results


Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as 
they are long and complicated. I can narrow it down to an isolation case 
if I need to. My indexed field in question is relatively short strings.


But what it's got to do with is the WordDelimiterFilter's default 
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.


Let's take a less confusing example, query MacBook. With a 
WordDelimiterFilter followed by something that downcases everything.


I think what the WDF (followed by case folding) is trying to do is make 
query MacBook match both indexed text mac book as well as macbook 
-- either one should be a match. Is my understanding right of what 
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is 
intending to do?


In my actual index, query MacBook is matching ONLY mac book, and not 
macbook.  Which is unexpected. I indeed want it to match both. (I 
realize I could make it match only 'macbook' by setting 
splitOnCaseChange=0 and/or generateWordParts=0).


It's possible this is happening as a side effect of other parts of my 
complex field definition, and I really do need to post hte whole thing 
and/or isolate it. But I wonder if there are known general problem cases 
that cause this kind of failure, or any known bugs in 
WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.


And I wonder if WordDelimiter filter spitting out the token MacBook 
with position 2 rather than 1 is expected, irrelevant, or possibly a 
relevant problem.


Thanks again,

Jonathan

On 9/2/14 12:59 PM, Michael Della Bitta wrote:

Hi Jonathan,

Little confused by this line:


And, what I think it's trying to do, is match text indexed as d elalain

as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Hello, I'm running into a case where a query is not returning the results
I expect, and I'm hoping someone can offer some explanation that might help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than just
lowercasing, but I think we can consider it lowercasing for the purposes of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d followed
by an uppercase letter, a special case for that. (I don't get this behavior
with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as d elalain
as well as text indexed by delalain.

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as delalain (one token).

I don't entirely understand what the position attribute is for -- but I
wonder if in this case, the position on dELALAIN is really supposed to be
1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug --
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for dELALAIN to match text indexed as
delalain (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Yes, thanks, I realize I can twiddle those parameters, but it will 
probably result in MacBook no longer matching mac book at all, but 
ONLY matching macbook.


My understanding of the default settings of WordDelimiterFactory is that 
they are intending for MacBook to match both mac book AND macbook.


I will try to create an isolation reproduction that demonstrates this 
ruling out interference from other filters (or identifying the other 
filters), to make my question more clear, I guess.


Jonathan

On 9/2/14 1:34 PM, Michael Della Bitta wrote:

If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case if
I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query MacBook. With a
WordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is make
query MacBook match both indexed text mac book as well as macbook --
either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?

In my actual index, query MacBook is matching ONLY mac book, and not
macbook.  Which is unexpected. I indeed want it to match both. (I realize
I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
generateWordParts=0).

It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in WordDelimiterFilter
(in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token MacBook with
position 2 rather than 1 is expected, irrelevant, or possibly a
relevant problem.

Thanks again,

Jonathan


On 9/2/14 12:59 PM, Michael Della Bitta wrote:


Hi Jonathan,

Little confused by this line:

  And, what I think it's trying to do, is match text indexed as d elalain



as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/
112002776285509593336/posts
w: appinions.com http://www.appinions.com/



On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  Hello, I'm running into a case where a query is not returning the results

I expect, and I'm hoping someone can offer some explanation that might
help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many
other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than
just
lowercasing, but I think we can consider it lowercasing for the purposes
of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d
followed
by an uppercase letter, a special case for that. (I don't get this
behavior
with other mixed case queries not beginning with 'd').

And, what I think it's

Re: WordDelimiter filter, expanding to multiple words, unexpected results


On 9/2/14 1:51 PM, Erick Erickson wrote:

bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?


Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both 
index and query phases (is that right?):


filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=1/


It's hard to cut and paste the results of the analysis page into email 
(or anywhere!), I'll give you screenshots, sorry -- and I'll give them 
for our whole real world app complex field definition. I'll also paste 
in our entire field definition below. But I realize my next step is 
probably creating a simpler isolation/reproduction case (unless you have 
a magic answer from this!).


Again, the problem is that MacBook seems to be only matching on 
indexed macbook and not indexed mac book.



MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

  fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer
   !-- the rulefiles thing is to keep ICUTokenizerFactory from 
stripping punctuation,

so our synonym filter involving C++ etc can still work.
From: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
the rbbi file is in our local ./conf, copied from lucene 
source tree --
   tokenizer class=solr.ICUTokenizerFactory 
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/


   filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt ignoreCase=true/


filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/



!-- folding need sto be after WordDelimiter, so WordDelimiter
 can do it's thing with full cases and such --
filter class=solr.ICUFoldingFilterFactory /


!-- ICUFolding already includes lowercasing, no
 need for seperate lowercasing step
filter class=solr.LowerCaseFilterFactory/
--

filter class=solr.SnowballPorterFilterFactory 
language=English protected=protwords.txt/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Jonathan Rochkind


On 12/17/13 1:16 PM, Chris Hostetter wrote:

As i mentioned in the blog above, as long as you have a uniqueKey field
that supports range queries, bulk exporting of all documents is fairly
trivial by sorting on your uniqueKey field and using an fq that also
filters on your uniqueKey field modify the fq each time to change the
lower bound to match the highest ID you got on the previous page.


Aha, very nice suggestion, I hadn't thought of this, when myself trying 
to figure out decent ways to 'fetch all documents matching a query' for 
some bulk offline processing.


One question that I was never sure about when trying to do things like 
this -- is this going to end up blowing the query and/or document caches 
if used on a live Solr?  By filling up those caches with the results of 
the 'bulk' export?  If so, is there any way to avoid that? Or does it 
probably not really matter?


Jonathan

Re: json update moves doc to end

2013-12-03 Thread Jonathan Rochkind


What order, the order if you supply no explicit sort at all?

Solr does not make any guarantees about what order documents will come 
back in if you do not ask for a sort.


In general in Solr/lucene, the only way to update a document is to 
re-add it as a new document, so that's probably what's going on behind 
the scenes, and it probably effects the 'default' sort order -- which 
Solr makes no agreement about anyway, you probably shouldn't even count 
on it being consistent at all.


If you want a consistent sort order, maybe add a field with a timestamp, 
and ask for results sorted by the timestamp field? And then make sure 
not to change the timestamp when you do an update that you don't want to 
change the order?


Apologies if I've misunderstood the situation.

On 12/3/13 1:00 PM, Andreas Owen wrote:

When I search for agenda I get a lot of hits. Now if I update the 2.
Result by json-update the doc is moved to the end of the index when I search
for it again. The field I change is editorschoice and it never contains
the search term agenda so I don't see why it changes the order. Why does
it?



Part of Solrconfig requesthandler I use:

requestHandler name=/select2 class=solr.SearchHandler

  lst name=defaults

 str name=echoParamsexplicit/str

 int name=rows10/int

  str name=defTypesynonym_edismax/str

str name=synonymstrue/str

str name=qfplain_text^10 editorschoice^200

title^20 h_*^14

tags^10 thema^15 inhaltstyp^6 breadcrumb^6
doctype^10

contentmanager^5 links^5

last_modified^5  url^5

/str

str name=bq(expiration:[NOW TO *] OR (*:*
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost
--

str name=bflog(clicks)^8/str !-- tested --

!-- todo: anzahl-links(count urlparse in links query) /
häufigkeit von suchbegriff (bf= count in title and text)--

  str name=dftext/str

str name=fl*,path,score/str

str name=wtjson/str

str name=q.opAND/str



!-- Highlighting defaults --

 str name=hlon/str

  str name=hl.flplain_text,title/str

str name=hl.simple.prelt;bgt;/str

 str name=hl.simple.postlt;/bgt;/str



  !-- lst name=invariants --

 str name=faceton/str

str name=facet.mincount1/str

 str
name=facet.field{!ex=inhaltstyp}inhaltstyp/str

str
name=f.inhaltstyp.facet.sortindex/str

str
name=facet.field{!ex=doctype}doctype/str

str name=f.doctype.facet.sortindex/str

str
name=facet.field{!ex=thema_f}thema_f/str

str name=f.thema_f.facet.sortindex/str

str
name=facet.field{!ex=author_s}author_s/str

str name=f.author_s.facet.sortindex/str

str
name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str

str
name=f.sachverstaendiger_s.facet.sortindex/str

str
name=facet.field{!ex=veranstaltung}veranstaltung/str

str
name=f.veranstaltung.facet.sortindex/str

str
name=facet.date{!ex=last_modified}last_modified/str

str
name=facet.date.gap+1MONTH/str

str
name=facet.date.endNOW/MONTH+1MONTH/str

str
name=facet.date.startNOW/MONTH-36MONTHS/str

str
name=facet.date.otherafter/str

/lst

/requestHandler

Re: Need idea to standardize keywords - ring tone vs ringtone

2013-10-28 Thread Jonathan Rochkind

Do you know about the Solr synonym feature?  That seems more applicable 
to what you're describing then stopwords. I'd stay away from stopwords 
entirely here, and try to do what you want with synonyms.


Multi-word synonyms can be tricky, I'm not entirely sure the right way 
to do it for this use case. But I think the synonym feature is what you 
want. Not the stopwords feature.




On 10/28/13 12:24 PM, Developer wrote:

Thanks for your response Eric. Sorry for the confusion.

I currently display both 'ring tone' as well as 'ringtone' when the user
types in 'r' but I am trying to figure out a way to display just 'ringtone'
hence I added 'ring tone' to stopwords list so that it doesn't get indexed.

I have the list of know keywords (more like synonyms) which I am trying to
map against the user entered keywords.

ring tone, ringer tine = ringtone





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-idea-to-standardize-keywords-ring-tone-vs-ringtone-tp4097794p4098103.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Jonathan Rochkind

This is good to know, and I find it welcome advice; I would recommend 
making sure this advice is clearly highlighted in the relevant Solr 
docs, such as any getting started docs.


I'm not sure everyone realizes this, and some go down tomcat route 
without realizing the Solr committers recommend jetty -- or use a stock 
jetty without realizing the 'example' jetty is recommended and actually 
intended to be used by Solr users in production!  I think it's easy to 
not catch this advice.


On 10/20/13 5:55 PM, Shawn Heisey wrote:

On 10/20/2013 2:57 PM, Shawn Heisey wrote:

We recommend jetty.  The solr example uses jetty.


I have a clarification for this statement.  We actually recommend using
the jetty that's included in the Solr 4.x example.  It is stripped of
all unnecessary features and its config has had some minor tuning so
it's optimized for Solr.  The jetty binaries in 4.x are completely
unmodified from the upstream download, we just don't include all of
them.  On the 1.x and 3.x examples, there was a small bug in Jetty 6, so
those versions included modified binaries.

If you download jetty from eclipse.org or install it from your operating
system's repository, it will include components you don't need and its
config won't be optimized for Solr, but it will still be a lot closer to
what's actually tested than tomcat is.

Thanks,
Shawn

solr 4.3, autocommit, maxdocs

2013-07-15 Thread Jonathan Rochkind

I have a solr 4.3 instance I am in the process of standing up. It 
started out with an empty index.


I have in it's solrconfig.xml,

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
  updateHandler

I have an index process running, that has currently added around 400k 
documents to Solr.


I had expected that a 'commit' would be run every 100k documents, from 
the above configuration, so 4 commits would have been run by now, and 
I'd see documents in the index.


However, when I look in the Solr admin interface, at my core's 
'overview' page, it still says num docs 0, segment count 0.  When I 
expected num docs 400k at this point.


Is there something I'm misunderstanding about the configuration or the 
admin interface? Or am I right in my expectations, but something else 
must be going wrong?


Thanks for any advice,

Jonathan

Re: solr 4.3, autocommit, maxdocs

2013-07-15 Thread Jonathan Rochkind

Ah, thanks for this explanation. Although I don't entirely understand 
it, I am glad there is an expected explanation!


This Solr instance is actually set up to be a replication master. It 
never gets searched itself, it just replicates to slaves that get searched.


Perhaps some time in the past (I am migrating from an already set up 
Solr 1.4 instance), I set this value to false, figuring it was not 
neccesary to actually open a searcher, since the master does not get 
searched itself ordinarily.


Despite the opensearcher=false... once committed, are the committed docs 
still going to be sent via replication to a slave, is the index used for 
replication actually changed, even though a searcher hasn't been opened 
to take account of it?  Or will the opensearcher=false keep the commits 
from being seen by replication slaves too?


Thanks for any tips,

Jonathan

On 7/15/13 12:57 PM, Jason Hellman wrote:

Jonathan,

Please note the openSearcher=false part of your configuration.  This is why you 
don't see documents.  The commits are occurring, and being written to segments 
on disk, but they are not visible to the search engine because a Solr searcher 
class has not opened them for visibility.

You can either change the value to true, or alternatively call a deterministic 
commit call at the end of your load (a solr/update?commit=true will default to 
openSearcher=true).

Hope that's of use!

Jason


On Jul 15, 2013, at 9:52 AM, Jonathan Rochkind rochk...@jhu.edu wrote:


I have a solr 4.3 instance I am in the process of standing up. It started out 
with an empty index.

I have in it's solrconfig.xml,

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
  updateHandler

I have an index process running, that has currently added around 400k documents 
to Solr.

I had expected that a 'commit' would be run every 100k documents, from the 
above configuration, so 4 commits would have been run by now, and I'd see 
documents in the index.

However, when I look in the Solr admin interface, at my core's 'overview' page, 
it still says num docs 0, segment count 0.  When I expected num docs 400k at 
this point.

Is there something I'm misunderstanding about the configuration or the admin 
interface? Or am I right in my expectations, but something else must be going 
wrong?

Thanks for any advice,

Jonathan

SolrJ and initializing logger in solr 4.3?

2013-07-11 Thread Jonathan Rochkind


I am using SolrJ in a Java (actually jruby) project, with Solr 4.3.

When I instantiate an HttpSolrServer, I get the dreaded:

log4j:WARN No appenders could be found for logger 
(org.apache.solr.client.solrj.impl.HttpClientUtil).

log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.



Using SolrJ as an embedded library in my own software, what is the 
proper or 'best practice' way -- or failing that, just any way at all -- 
to initialize log4j under Solr 4.3?


I am not super familiar with Java or log4j; hopefully there is an easy 
way to do this?


(If someone has a way especially suited for jruby, even better; but just 
a standard Java answer would be great too.)


Thanks for any advice!

SolrJ 4.3 to Solr 1.4

2013-07-11 Thread Jonathan Rochkind

So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically 
to add documents.


The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that 
this should work, so long as you:


server.setParser(new XMLResponseParser());

However, when I do this, I still get a 
org.apache.solr.common.SolrException: parsing error from 
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143)


(If I _don't_ setParser to XML, and use the binary parser... I get a 
fully expected error about binary format corruption -- that part is 
expected and I understand it, that's why you have to use the 
XMLResponseParser instead).


Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4 
server in pure XML? I've set the parser to the XMLResponseParser, do I 
also have to somehow tell it to actually use the Solr 1.4 XML update 
handler or something?  I don't entirely understand what I'm talking about.


Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk 
to Solr 1.4, is the wiki wrong that this is possible?


Thanks for any help,

Jonathan

Re: SolrJ 4.3 to Solr 1.4

2013-07-11 Thread Jonathan Rochkind

Huh, that might have been a false problem of some kind.

At the moment, it looks like I _do_ have my SolrJ 4.3 succesfully
talking to a Solr 1.4, so long as I setParser(new XMLResponseParser()).

Not sure what I changed or what wasn't working before, but great!

So nevermind. Although if anyone reading this wants to share any other
potential gotchas on solrj 4.3 talking to solr 1.4, feel free!

On 7/11/13 4:24 PM, Jonathan Rochkind wrote:

So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically
to add documents.

The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that
this should work, so long as you:

server.setParser(new XMLResponseParser());

However, when I do this, I still get a
org.apache.solr.common.SolrException: parsing error from
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143)

(If I _don't_ setParser to XML, and use the binary parser... I get a
fully expected error about binary format corruption -- that part is
expected and I understand it, that's why you have to use the
XMLResponseParser instead).

Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4
server in pure XML? I've set the parser to the XMLResponseParser, do I
also have to somehow tell it to actually use the Solr 1.4 XML update
handler or something? I don't entirely understand what I'm talking about.

Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk
to Solr 1.4, is the wiki wrong that this is possible?

Thanks for any help,

Jonathan

Solr, ICUTokenizer with Latin-break-only-on-whitespace

2013-06-20 Thread Jonathan Rochkind


(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E

Which explain a way you can supply custom rulefiles to ICUTokenizer, in 
this case to tell it to only break on whitespace for Latin character 
substrings.


I am trying to use the technique explained there in Solr 4.3, but either 
it's not working, or it's not doing what I'd expect.


I want, for instance, C++ Language to be tokenized into C++, 
Language.  But the ICUTokenizer, even with the 
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi file 
from the Solr 4.3 source [1].


But the ICUTokenizer, even with the that rulefile, is still stripping 
the punctuation, and tokenizing that into C, Language.


Can anyone give me any guidance or hints? I don't entirely understand 
the semantics of the rbbi file to try debugging there. Is something not 
working, or does the rbbi file just not express the semantics I want?


Thanks for any tips.



[1] 
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557view=markup

Re: Solr, ICUTokenizer with Latin-break-only-on-whitespace

2013-06-20 Thread Jonathan Rochkind

Thank you... I started out writing an email with screenshots proving 
that it wasn't working for me in 4.3.0... and of course, having to 
confirm every single detail in order to say I confirmed it... I realized 
it was a mistake on my part, not testing what I thought I was testing.


Does indeed appear to be working now. Thanks! And thanks for this feature.


On 6/20/2013 3:40 PM, Shawn Heisey wrote:

On 6/20/2013 1:26 PM, Jonathan Rochkind wrote:

I want, for instance, C++ Language to be tokenized into C++,
Language.  But the ICUTokenizer, even with the
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi
file from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping
the punctuation, and tokenizing that into C, Language.


This screenshot is using branch_4x downloaded and compiled a couple of
hours ago, with the rbbi file you mentioned copied to the conf directory:

https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png


It shows that the ++ is maintained by the ICU tokenizer. It also
illustrates a UI bug that I will have to show to steffkes where the ++
is lost from the input field after analysis.

Thanks,
Shawn

Solr 4.3, Tomcat, Error filterStart


I am trying to get Solr installed in Tomcat, and having trouble.

I am trying to use the instructions at 
http://wiki.apache.org/solr/SolrTomcat as a guide.  Trying to start with 
the example Solr from the Solr distro. Tried using the Tried with both a 
binary distro with existing solr.war, and with compiling my own solr.war.


* Solr 4.3.0
* Tomcat 6.0.29
* JVM 1.6

When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors


And solr is not actually deployed, naturally.

I've tried to google for advice on this -- mostly what I found was 
suggestions for how to turn up logging to get more info (maybe a stack 
trace?) to give you more clues what's failing -- but nothing I found 
suggested succesfully worked to turn up logging.


So I'm at a bit of a loss. Any suggestions? Any ideas what might be 
causing this error, and/or how to get more information on what's causing it?

Re: Solr 4.3, Tomcat, Error filterStart

Thanks! I guess I should have asked on-list BEFORE wasting 4 hours 
fighting with it myself, but I was trying to be a good user and do my 
homework!  Oh well.


Off to the logging instructions, hope I can figure them out -- if you 
could update the tomcat instructions with the simplest possible way to 
get deploy in Tomcat to work, that'd def be helpful!


On 5/30/2013 10:41 AM, Shawn Heisey wrote:

I am trying to get Solr installed in Tomcat, and having trouble.




When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors




I've tried to google for advice on this -- mostly what I found was
suggestions for how to turn up logging to get more info


In a cruel twist of fate, it is actually logging changes that are
preventing Solr from starting. The required steps for deploying 4.3
changed. I will update the wiki page about tomcat when I'm not on a train.
  See this page for additional instructions, specifically the section about
deploying on containers other than jetty:

http://wiki.apache.org/solr/SolrLogging

Thanks,
Shawn

Re: Solr 4.3, Tomcat, Error filterStart

I'm going to add a note to http://wiki.apache.org/solr/SolrLogging , 
with the Tomcat sample Error filterStart error, as an example of 
something you might see if you have not set up logging.


Then at least in the future, googling solr tomcat error filterStart 
might lead someone to the clue that it might be logging.



On 5/30/2013 10:41 AM, Shawn Heisey wrote:

I am trying to get Solr installed in Tomcat, and having trouble.




When I start up tomcat, I get in the Tomcat log:


INFO: Deploying web application archive solr.war
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors




I've tried to google for advice on this -- mostly what I found was
suggestions for how to turn up logging to get more info


In a cruel twist of fate, it is actually logging changes that are
preventing Solr from starting. The required steps for deploying 4.3
changed. I will update the wiki page about tomcat when I'm not on a train.
  See this page for additional instructions, specifically the section about
deploying on containers other than jetty:

http://wiki.apache.org/solr/SolrLogging

Thanks,
Shawn

Re: Solr 4.3, Tomcat, Error filterStart


Okay, sadly, i still can't get this to work.

Following the instructions at:
https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty

I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied 
solr/example/resources/log4j.properties there too.


The result is unchanged, when I start tomcat, it still says:

May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors


This is very frustrating. I have no way to even be sure this problem 
really is logging related, although it seems likely. But I feel like I'm 
just randomly moving chairs around and hoping the error will go away, 
and it does not.


Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can 
we even confirm this is possible?  Can anyone give me any other hints, 
especially does anyone have any idea how to get some more logging out of 
Tomcat, then the fairly useless Error filterSTart?


The only reason I'm using tomcat is that we always have in our current 
Solr 1.4-based application, for reasons lost to time. I was hoping to 
upgrade to Solr 4.3, without simultaneously switching our infrastructure 
from tomcat to jetty, change one thing at a time. I suppose I might need 
to abandon that and switch to jetty too, but I'd rather not.

Re: Solr 4.3, Tomcat, Error filterStart

Okay, for posterity: I did manage to get it working. It WAS lack of the
logging files.

First, the only way I could manage to get Tomcat6 to log an actual
stacktrace for the Error filterStart was to _delete_ my
CATALINA_HOME/conf/logging.properties file. Apparently without this
file at all, the default ends up being 'log everything'.

And once that happened, it did confirm that the Error filterStart
problem WAS an inability to find the logging jars. (And the stack trace
was an exception from Solr with a nice message including the URL to the
logging wiki page, nice one solr). Nothing I tried before in a fit of
desperation deleting that file entirely worked to get the stack trace
logged.

Once confirmed that the problem really was not finding the logging jars,
I could keep doing things and restarting and seeing if that was still
the exception.

And I found that for some reason, despite
http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html
suggesting that jars could be found in either CATALINA_BASE/lib (for me
/opt/tomcat6/lib), OR CATALINA_BASE/lib (for me /usr/share/tomcat6/lib),
in fact for whatever reason /opt/tomcat6/lib was being ignored, but
/usr/share/tomcat6/lib worked.

And now I succesfully have solr started in tomcat.

I realize that these are all tomcat6 issues, not solr issues. But others
trying to get solr started may have similar problems. Appreciate the tip
that the Error filterStart was probably related to new solr 4.3.0
logging setup, which ended up confirmed.

Jonathan

On 5/30/2013 3:19 PM, Jonathan Rochkind wrote:

Okay, sadly, i still can't get this to work.

Following the instructions at:
https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty

I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied
solr/example/resources/log4j.properties there too.

The result is unchanged, when I start tomcat, it still says:

May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/solr] startup failed due to previous errors

This is very frustrating. I have no way to even be sure this problem
really is logging related, although it seems likely. But I feel like I'm
just randomly moving chairs around and hoping the error will go away,
and it does not.

Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can
we even confirm this is possible? Can anyone give me any other hints,
especially does anyone have any idea how to get some more logging out of
Tomcat, then the fairly useless Error filterSTart?

The only reason I'm using tomcat is that we always have in our current
Solr 1.4-based application, for reasons lost to time. I was hoping to
upgrade to Solr 4.3, without simultaneously switching our infrastructure
from tomcat to jetty, change one thing at a time. I suppose I might need
to abandon that and switch to jetty too, but I'd rather not.

replication without automated polling, just manual trigger?

2013-05-15 Thread Jonathan Rochkind

I want to set up Solr replication between a master and slave, where no 
automatic polling every X minutes happens, instead the slave only 
replicates on command. [1]


So the basic question is: What's the best way to do that? But I'll 
provide what I've been doing etc., for anyone interested.


Until recently, my appliation was running on Solr 1.4.  I had a setup 
that was working to accomplish this in Solr 1.4, but as I work on moving 
it to Solr 4.3, it's unclear to me if it can/will work the same way.


In Solr 1.4, on slave,  I supplied a masterUrl, but did NOT supply any 
pollInterval at all on slave.  I did NOT supply an enable
false in slave, because I think that would have prevented even manual 
replication.


This seemed to result in the slave never polling, although I'm not sure 
if that was just an accident of Solr implementation or not.  Can anyone 
say if the same thing would happen in Solr 4.3?  If I look at the admin 
screen for my slave set up this way in Solr 4.3, it does say polling 
enabled, but I realize that doesn't neccesarily mean any polling will 
take place, since I've set no pollInterval.


In Solr 1.4 under this setup, I could go to the slave's 
admin/replication, and there was a replicate now button that I could 
use for manually triggered replication.  This button seems to no longer 
be there in 4.3 replication admin screen, although I suppose I could 
still, somewhat less conveniently, issue a 
`replication?command=fetchindex` to the slave, to manually trigger a 
replication?




Thanks for any advice or ideas.



[1]: Why, you ask?  The master is actually my 'indexing' server. Due to 
business needs, indexing only happens in bulk/mass indexing, and only 
happens periodically -- sometimes nightly, sometimes less. So I index on 
master, at a periodic schedule, and then when indexing is complete and 
verified, tell slave to replicate.  I don't want slave accidentally 
replicating in the middle of the bulk indexing process either, when the 
index might be in an unfinished state.

writing a custom Filter plugin?

2013-05-13 Thread Jonathan Rochkind

Does anyone know of any tutorials, basic examples, and/or documentation 
on writing your own Filter plugin for Solr? For Solr 4.x/4.3?


I would like a Solr 4.3 version of the normalization filters found here 
for Solr 1.4: https://github.com/billdueber/lib.umich.edu-solr-stuff


But those are old, for Solr 1.4.

Does anyone have any hints for writing a simple substitution Filter for 
Solr 4.x?  Or, does a simple sourcecode example exist anywhere?

Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Jonathan Rochkind

When I do things like this and want to avoid empty tokens even though 
previous analysis might result in some--I just throw one of these at the 
end of my analysis chain:


!-- get rid of empty string tokens. max is required, although
 we don't really care. --
filter class=solr.LengthFilterFactory min=1 max=/

A charfilter to filter raw characters can certainly still result in an 
empty token, if an initial token was composed solely of chars you wanted 
to filter out!  In which case you probably want the token to be deleted 
entirely, not still there as an empty token. The above length filter is 
one way to do that, although unfortunately requires specifying a 'max' 
even though I didn't actually want to filter out on the high end, oh well.



On 9/24/2012 1:07 PM, Jack Krupansky wrote:

I tried it and PRFF is indeed generating an empty token. I don't know
how Lucene will index or query an empty term. I mean, what it should
do. In any case, it is best to avoid them.

You should be using a charFilter to simply filter raw characters
before tokenizing. So, try:

charFilter class=solr.PatternReplaceCharFilterFactory/

It has the same pattern and replacement attributes.

-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, September 24, 2012 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex \p{Punct}:
POSIX character classes (US-ASCII only), so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your {
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-Original Message- From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr - Remove specific punctuation marks

I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

but still the numfound gives 1

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to exactly match fields which are multi-valued?

2012-03-08 Thread Jonathan Rochkind

Well, if you really want EXACT exact, just use a KeywordTokenizer (ie, 
not tokenize at all). But then matches will really have to be EXACT, 
including punctuation, whitespace, diacritics, etc.  But a query will 
only match if it 'exactly' matches one value in your multi-valued field.


You could try a KeywordTokenizer with some normalization too.

Either way, though, if you're issuing a query to a field tokenized with 
KeywordTokenizer that can include whitespace in it's values, you really 
need to issue it as a _phrase query_, to avoid being messed up by the 
lucene or dismax query parser's pre tokenization.  Which is 
potentially fine, that's what you want to do anyway for 'exact match'.  
Except if you wanted to use dismax multiple qf's with just a BOOST on 
the 'exact match', but _not_ a phrase query for other fields... well, I 
can't figure out any way to do it with this technique.


It gets tricky, I haven't found a great solution.

On 3/8/2012 7:44 AM, Erick Erickson wrote:

You haven't really given us much to go on here. Matches
are just like a single valued field with the exception of
the increment gap. Say one entry were
large cat big dog
in a multi-valued field. ay the next document
indexed two values,
large cat
big dog

And, say the increment gap were 100. The token offsets
for doc 1 would be
0, 1, 2, 3
and for doc 2 would be
0, 1, 101, 102

The only effective difference is that phrase queries with slop
less than 100 would NEVER match across multi-values. I.e.
cat big~10 would match doc1 but not doc 2

Best
Erick

2012/3/7 SuoNayisuonayi2...@163.com:

Hi all, how to offer exact-match capabilities on the multi-valued fields?

Any helps are appreciated!

SuoNayi

Re: need to support bi-directional synonyms

2012-02-23 Thread Jonathan Rochkind


Honestly, I'd just map em both the same thing in the index.

sprayer, washer = sprayer

or

sprayer, washer = sprayer_washer

At both index and query time. Now if the source document includes either 
'sprayer' or 'washer', it'll get indexed as 'sprayer_washer'.  And if 
the user enters either 'sprayer' or 'washer', it'll search the index for 
'sprayer_washer', and find source documents that included either 
'sprayer' or 'washer'.


Of course, if you really use sprayer_washer, then if the user actually 
enters sprayer_washer they'll also find sprayer, washer, and 
sprayer_washer.


So it's probably best to actually use either 'sprayer' or 'washer' as 
the destination, even though it seems odd:


sprayer, washer = washer

Will do what you want, pretty sure.

On 2/23/2012 1:03 AM, remi tassing wrote:

Same question here...

On Wednesday, February 22, 2012, geeky2gee...@hotmail.com  wrote:

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer.

would i set up my synonym file like this?

assuming expand = true..

sprayer =  washer
washer =  sprayer

thank you,
mark

--
View this message in context:

http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Jonathan Rochkind

So I don't really know what I'm talking about, and I'm not really sure 
if it's related or not, but your particular query:


The Beatles as musicians : Revolver through the Anthology

With the lone word that's a ':', reminds me of a dismax stopwords-type 
problem I ran into. Now, I ran into it on 1.4.  I don't know why it 
would be different on 1.4 and 3.x. And I see you aren't even using a 
multi-field dismax in your sample query, so it couldn't possibly be what 
I ran into... I don't think. But I'll write this anyway in case it gives 
someone some ideas.


The problem I ran into is caused by different analysis in two fields 
both used in a dismax, one that ends up keeping : as a token, and one 
that doesn't.  Which ends up having the same effect as the famous 
'dismax stopwords problem'.


Maybe somehow your schema changed such to produce this problem in 3.x 
but not in 1.4? Although again I realize the fact that you are only 
using a single field in your demo dismax query kind of suggests it's not 
this problem. Wonder if you try the query without the :, if the 
problem goes away, that might be a hint. Or, maybe someone more skilled 
at understanding what's in those Solr debug statements than I am (it's 
kind of all greek to me) will be able to take this hint and rule out or 
confirm that it may have something to do with your problem.


Here I write up the issue I ran into (which may or may not have anything 
to do with what you ran into)


http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/


Also, you don't say what your 'mm' is in your dismax queries, that could 
be relevant if it's got anything to do with anything similar to the 
issue I'm talking about.


Hmm, I wonder if Solr 3.x changes the way dismax calculates number of 
tokens for 'mm' in such a way that the 'varying field analysis dismax 
gotcha' can manifest with only one field, if the way dismax counts 
tokens for 'mm' differs from number of tokens the single field's 
analysis produces?


Jonathan

On 2/22/2012 2:55 PM, Naomi Dushay wrote:

I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   I 
have a test checking for a search result in Solr, and the test passes in Solr 
1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I just 
included output from lucene QueryParser to prove the document exists and is 
found

I am completely stumped.


Here are the debugQuery details:

***Solr 3.5***

lucene QueryParser:

URL:   q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
antholog), product of:
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.02063975 = queryNorm
   6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
 1.0 = tf(phraseFreq=1.0)
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
 0.125 = fieldNorm(field=all_search, doc=1064395)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:   +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

(no matches)


***Solr 1.4***

lucene QueryParser:

URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
   0.109375 = fieldNorm(field=all_search, doc=3469163)

dismax QueryParser:
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the 
Anthology
final query:  +(all_search:the beatl as musician revolv through the antholog~1)~0.01 
(all_search:the beatl as musician revolv through the antholog~3)~0.01

score:

7.449651 = (MATCH) sum of:
   3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~1 in 3469163), product of:
 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the 
antholog~1), product of:
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 
musician=11955 revolv=820 through=88238 the=3542123 antholog=11205)
   0.014681898 = queryNorm
 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)

Re: replication, disk space


Thanks for the response. I am using Linux (RedHat).

It sounds like it may possibly be related to that bug.

But the thing is, the timestamped index directory is looking to me like 
it's the _current_ one, with the non-timestamped one being an old out of 
date one.  So that does not seem to be quite the same thing reported in 
that bug, although it may very well be related.


At this point, I'm just trying to figure out how to clean up.  How to 
verify which of those copies really is the current one, which is 
currently being used by Solr -- and if it's the timestamped one, how to 
restore things to the state where there's only one non-timestamped index 
dir, ideally without downtime to Solr.


Anyone have any advice or ideas on those questions?

On 1/18/2012 1:23 PM, Artem Lokotosh wrote:

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
replication, it only replicates irregularly when I issue a replicate command
to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently being
used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around indefinitely.
  After replication completes and is committed, why would two index dirs be
left?  And how can I restore this to one index dir, without downtime? If
it's really using the index.X directory, then I could just delete the
index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And if it's
using the timestamped index file now, I can't delete THAT one now either.

If I was willing to restart the tomcat container, then I could delete one,
rename the other, etc. But I don't want downtime.

I really don't understand what's going on or how it got in this state. Any
ideas?

Jonathan

Re: replication, disk space

Hmm, I don't have a replication.properties file, I don't think. Oh 
wait, yes I do there it is!  I guess the replication process makes this 
file?


Okay

I don't see an index directory in the replication.properties file at all 
though. Below is my complete replication.properties.


So I'm still not sure how to properly recover from this situation 
withotu downtime. It _looks_ to me like the timestamped directory is 
actually the live/recent one.  It's files have a more recent timestamp, 
and it's the one that /admin/replication.jsp mentions.


replication.properties:

#Replication details
#Wed Jan 18 10:58:25 EST 2012
confFilesReplicated=[solrconfig.xml, schema.xml]
timesIndexReplicated=350
lastCycleBytesDownloaded=6524299012
replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902
replicationFailedAt=1326902305288
timesConfigReplicated=1
indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294
confFilesReplicatedAt=1316547200637
previousCycleTimeInSeconds=295
timesFailed=54
indexReplicatedAt=1326902305288
~


On 1/18/2012 1:41 PM, Dyer, James wrote:

I've seen this happen when the configuration files change on the master and replication deems it necessary to 
do a core-reload on the slave. In this case, replication copies the entire index to the new directory then 
does a core re-load to make the new config files and new index directory go live.  Because it is keeping the 
old searcher running while the new searcher is being started, both index copies to exist until the swap is 
complete.  I remember having the same concern about re-starts, but I believe I tested this and solr will look 
at the replication.properties file on startup and determine the correct index dir to use from 
that.  So (If my memory is correct) you can safely delete index so long as 
replication.properties points to the other directory.

I wasn't familiar with SOLR-1781.  Maybe replication is supposed to clean up the extra 
directories and doesn't sometimes?  In any case, I've found whenever it happens its ok to 
go out and delete the one(s) not being used, even if that means deleting 
index.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Artem Lokotosh [mailto:arco...@gmail.com]
Sent: Wednesday, January 18, 2012 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: replication, disk space

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
replication, it only replicates irregularly when I issue a replicate command
to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently being
used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around indefinitely.
  After replication completes and is committed, why would two index dirs be
left?  And how can I restore this to one index dir, without downtime? If
it's really using the index.X directory, then I could just delete the
index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And if it's
using the timestamped index file now, I can't delete THAT one now either.

If I was willing to restart the tomcat container, then I could delete one,
rename the other, etc. But I don't want downtime.

I really don't understand what's going on or how it got in this state. Any
ideas?

Jonathan

Re: replication, disk space


On 1/18/2012 1:53 PM, Tomás Fernández Löbbe wrote:

As far as I know, the replication is supposed to delete the old directory
index. However, the initial question is why is this new index directory
being created. Are you adding/updating documents in the slave? what about
optimizing it? Are you rebuilding the index from scratch in the master?


Thanks for the response. Not adding/updating in slave. Not optimizing in 
slave. YES sometimes rebuilding index from scratch in master.


I am on Linux, RedHat 5.

This server has also been occasionally been having out-of-disk problems, 
which caused some replications to fail, an aborted replication could 
also possibly account for the extra index directory, perhaps? (It now 
has enough disk space to avoid that problem).


At this point, my main concern is getting things back in an expected 
stable state at this point, eliminating the extra index dir, ideally 
without downtime.

Re: replication, disk space

Okay, I do have an index.properties file too, and THAT one does contain 
the name of an index directory.


But it's got the name of the timestamped index directory!  Not sure how 
that happened, could have been Solr trying to recover from running out 
of disk space in the middle of a replication? I certainly never did that 
intentionally.


But okay, if someone can confirm if this plan makes sense to restore 
things without downtime:


1. rm the 'index' directory, which seems to be an old copy of the index 
at this point

2. 'mv index.20120113121302 index'
3. Manually edit index.properties to have index=index, not 
index=index.20120113121302

4. Send reload core command.

Does this make sense?  (I just experimentally tried an reload core 
command, and even though it's not supposed to, it DID result in about 20 
seconds of unresponsiveness from my solr server, not sure why, could 
just be lack of CPU or RAM on the server to do what's being asked of it. 
But if that's the best I can do, 20 minutes of unavailability, I'll take 
it).


On 1/19/2012 12:37 PM, Jonathan Rochkind wrote:
Hmm, I don't have a replication.properties file, I don't think. Oh 
wait, yes I do there it is!  I guess the replication process makes 
this file?


Okay

I don't see an index directory in the replication.properties file at 
all though. Below is my complete replication.properties.


So I'm still not sure how to properly recover from this situation 
withotu downtime. It _looks_ to me like the timestamped directory is 
actually the live/recent one.  It's files have a more recent 
timestamp, and it's the one that /admin/replication.jsp mentions.


replication.properties:

#Replication details
#Wed Jan 18 10:58:25 EST 2012
confFilesReplicated=[solrconfig.xml, schema.xml]
timesIndexReplicated=350
lastCycleBytesDownloaded=6524299012
replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902 


replicationFailedAt=1326902305288
timesConfigReplicated=1
indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294 


confFilesReplicatedAt=1316547200637
previousCycleTimeInSeconds=295
timesFailed=54
indexReplicatedAt=1326902305288
~


On 1/18/2012 1:41 PM, Dyer, James wrote:
I've seen this happen when the configuration files change on the 
master and replication deems it necessary to do a core-reload on the 
slave. In this case, replication copies the entire index to the new 
directory then does a core re-load to make the new config files and 
new index directory go live.  Because it is keeping the old searcher 
running while the new searcher is being started, both index copies to 
exist until the swap is complete.  I remember having the same concern 
about re-starts, but I believe I tested this and solr will look at 
the replication.properties file on startup and determine the 
correct index dir to use from that.  So (If my memory is correct) you 
can safely delete index so long as replication.properties points 
to the other directory.


I wasn't familiar with SOLR-1781.  Maybe replication is supposed to 
clean up the extra directories and doesn't sometimes?  In any case, 
I've found whenever it happens its ok to go out and delete the one(s) 
not being used, even if that means deleting index.


James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Artem Lokotosh [mailto:arco...@gmail.com]
Sent: Wednesday, January 18, 2012 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: replication, disk space

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu  
wrote:
So Solr 1.4. I have a solr master/slave, where it actually doesn't 
poll for
replication, it only replicates irregularly when I issue a replicate 
command

to it.

After the last replication, the slave, in solr_home, has a data/index
directory as well as a data/index.20120113121302 directory.

The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently 
being

used live, not the straight 'index'? Why?

I can't afford the disk space to leave both of these around 
indefinitely.
  After replication completes and is committed, why would two index 
dirs be
left?  And how can I restore this to one index dir, without 
downtime? If
it's really using the index.X directory, then I could just 
delete the

index directory, but that's a bad idea, because next time the server
starts it's going to be looking for index, not index..  And 
if it's
using the timestamped index file now, I can't delete THAT one now 
either.


If I was willing

replication, disk space

2012-01-18 Thread Jonathan Rochkind

So Solr 1.4. I have a solr master/slave, where it actually doesn't poll 
for replication, it only replicates irregularly when I issue a replicate 
command to it.


After the last replication, the slave, in solr_home, has a data/index 
directory as well as a data/index.20120113121302 directory.


The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently 
being used live, not the straight 'index'? Why?


I can't afford the disk space to leave both of these around 
indefinitely.  After replication completes and is committed, why would 
two index dirs be left?  And how can I restore this to one index dir, 
without downtime? If it's really using the index.X directory, then 
I could just delete the index directory, but that's a bad idea, 
because next time the server starts it's going to be looking for 
index, not index..  And if it's using the timestamped index file 
now, I can't delete THAT one now either.


If I was willing to restart the tomcat container, then I could delete 
one, rename the other, etc. But I don't want downtime.


I really don't understand what's going on or how it got in this state. 
Any ideas?


Jonathan

replication failure, logs or notice?

2012-01-12 Thread Jonathan Rochkind

I think maybe my Solr 1.4 replications have been failing for quite some 
time, without me realizing it, possibly due to lack of disk space to 
replicate some large segments.


Where would I look to see if a replication failed? Just the standard 
solr log?  What would I look for?


There's no facility to have, like an email sent if replication fails or 
anything, is there?


I realize that Solr/java logging is something that still confuses me, 
I've done whatever was easiest, but I'm vaguely remembering now that by 
picking the right logging framework and configuring it properly, maybe 
you can send different types of events to different logs, like maybe 
replication events to their own log? Is this a thing?


Thanks for any ideas,

Jonathan

Re: changing omitNorms on an already built index

2011-11-07 Thread Jonathan Rochkind


On 10/27/2011 9:14 PM, Erick Erickson wrote:

Well, this could be explained if your fields are very short. Norms
are encoded into (part of?) a byte, so your ranking may be unaffected.

Try adding debugQuery=on and looking at the explanation. If you've
really omitted norms, I think you should see clauses like:

1.0 = fieldNorm(field=features, doc=1)
in the output, never something like


Thanks, this was very helpful. Indeed with debugQuery on, I get 1.0 = 
fieldNorm on my index with omitNorms for the relevant field, and in my 
index without omitNorms for the relevant field, I get a non-unit value 
= fieldNorm, thanks for giving me a way to reassure myself that 
omitNorms really is doing it's thing.


Now to dive into my debugQuery and figure out why it doesn't seem to be 
having as much effect as I anticipated on relevance!

changing omitNorms on an already built index

2011-10-27 Thread Jonathan Rochkind

So Solr 1.4.  I decided I wanted to change a field to have 
omitNorms=true that didn't previously.


So I changed the schema to have omitNorms=true.  And I reindexed all 
documents.


But it seems to have had absolutely no effect. All relevancy rankings 
seem to be the same.


Now, I could have a mistake somewhere else, maybe I didn't do what I 
thought.


But I'm wondering if there are any known issues related to this, is 
there something special you have to do to change a field from 
omitNorms=false to omitNorms=true on an already built index?  Other than 
re-indexing everything?Any known issues relevant here?


Thanks for any help,

Jonathan

Re: Questions about LocalParams syntax

2011-09-20 Thread Jonathan Rochkind

I don't have the complete answer. But I _think_ if you do one 'bq' param 
with multiple space-seperated directives, it will work.


And escaping is a pain.  But can be made somewhat less of a pain if you 
realize that single quotes can sometimes be used instead of 
double-quotes. What I do:


_query_:{!dismax qf='title something else'}

So by switching between single and double quotes, you can avoid need to 
escape. Sometimes you still do need to escape when a single or double 
quote is actually in a value (say in a 'q'), and I do use backslash 
there. If you had more levels of nesting though... I have no idea what 
you'd do.


I'm not even sure why you have the internal quotes here:

bq=\format:\\\Book\\\^50\


Shouldn't that just be bq='format:Book^50', what's the extra double 
quotes around Book?  If you don't need them, then with switching 
between single and double, this can become somewhat less crazy and error 
prone:


_query_:{!dismax bq='format:Book^50'}

I think. Maybe. If you really do need the double quotes in there, then I 
think switching between single and double you can use a single backslash 
there.



On 9/20/2011 9:39 AM, Demian Katz wrote:

I'm using the LocalParams syntax combined with the _query_ pseudo-field to 
build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm 
running into some syntax questions that don't seem to be addressed by the wiki 
page here:

http://wiki.apache.org/solr/LocalParams


1.)How should I deal with repeating parameters?  If I use multiple boost 
queries, it seems that only the last one listed is used...  for example:

((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ 
bq=\format:Journal^150\}test))

 boosts Journals, but not Books.  If I reverse the order of the 
two bq parameters, then Books get boosted instead of Journals.  I can work 
around this by creating one bq with the clauses OR'ed together, but I would 
rather be able to apply multiple bq's like I can elsewhere.


2.)What is the proper way to escape quotes?  Since there are multiple 
nested layers of double quotes, things get ugly and it's easy to end up with 
syntax errors.  I found that this syntax doesn't cause an error:


((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:\\\Book\\\^50\ 
bq=\format:\\\Journal\\\^150\}test))

 ...but it also doesn't work correctly - the boost queries are 
completely ignored in this example.  Perhaps this is more a problem related to  
_query_ than to LocalParams syntax...  but either way, a solution would be 
great!

thanks,
Demian

Re: XML injection interface in select servlet?

2011-09-20 Thread Jonathan Rochkind


On Sep 20, 2011, at 04:33 , Jan Peter Stotz wrote:



I am now asking myself why would someone implement such a bloodcurdling
vulnerability into a web service? Until now I haven't found an exploit
using the parameters in a way an attacker would get an advantage. But the
way those parameters are implemented raise some doubts on my side if
security has been seriously taken into account while implementing Solr...


Solr committers can correct me if I'm wrong, but my impression is that 
the Solr API itself is generally _not_ intended to be exposed to the 
world. It's expected to be protected behind a firewall, accessed by 
trusted applications.


People periodically post to this list planning on exposing it to the 
world anyway; but my impression is there may be all kinds of security 
problems there, as well as DoS possibilities, etc.


So I think it may be safe to say that security has not been seriously 
taken into account -- if you mean security on a Solr instance which has 
it's entire API exposed publically to the world.  I don't think that's 
the intended use case.

Re: JSON indexing failing...

2011-09-19 Thread Jonathan Rochkind

So I'm not an expert in the Solr JSON update message, never used it 
before myself. It's documented here:


http://wiki.apache.org/solr/UpdateJSON

But Solr is not a structured data store like mongodb or something; you 
can send it an update command in JSON as a convenience, but don't let 
that make you think it can store arbitrarily nested structured data like 
mongodb or couchdb or something.


Solr has a single flat list of indexes, as well as stored fields which 
are also a single flat list per-document. You can format your update 
message as JSON in Solr 3.x, but you still can't tell it to do something 
it's incapable of. If a field is multi-valued, according to the 
documentation, the json value can be an array of values. But if the JSON 
value is a hash... there's nothing Solr can do with this, it's not how 
solr works.



It looks from the documentation that the value can sometimes be a hash 
when you're communicating other meta-data to Solr, like field boosts:


my_boosted_field: {/* use a map with boost/value for a 
boosted field */

  boost: 2.3,
  value: test
},

But you can't just give it arbitrary JSON, you have to give it JSON of 
the sort it expects. Which does not include arbitrarily nested data hashes.


Jonathan

Re: query for point in time

2011-09-15 Thread Jonathan Rochkind

You didn't tell us what your schema looks like, what fields with what 
types are involved.


But similar to how you'd do it in your database, you need to find 
'documents' that have a start date before your date in question, and an 
end date after your date in question, to find the ones whose range 
includes your date in question.


Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about 
'John Smith', through another AND clause or an 'fq'.


But in general, if you've got a db with this info already, and this is 
all you need, why not just use the db?  Multi-hieararchy data like this 
is going to give you trouble in Solr eventually, you've got to arrange 
the solr indexes/schema to answer your questions, and eventually you're 
going to have two questions which require mutually incompatible schema 
to answer.


An rdbms is a great general purpose question answering tool for 
structured data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

Re: query for point in time

2011-09-15 Thread Jonathan Rochkind


I think there's something wrong with your database then, but okay.

You still haven't said what your Solr schema looks like -- that list of 
values doesn't say what the solr field names or types are. I think this 
is maybe because you don't actually have a Solr database and have no 
idea how Solr works, you're just asking in theory? On the other hand, 
you just said you have better performance with solr -- I'm not sure how 
you were able to tell the performance of solr in answering these queries 
if you don't even know how to make them!


But, again, assuming your data is set up like i'm guessing it is, it's 
quite similar to what you'd do with an rdbms.


What does 'most current' mean? Can jobs be overlapping? To find the 
project with the latest start date for a given person, just limit to 
documents with that current person in a 'q' or 'fq', and then sort by 
start_date desc. Perhaps limit to 1 if you really only want one hit.  
Same principle as you would in an rdbms.


Again, this requires setting up your solr index in such a way to answer 
these sorts of questions. Each document in Solr will represent a 
person-project pair.  It'll have fields for person (or multiple fields, 
personID, personFirst, personLast, etc), project name, project start 
date, project end date.  This will make it easy/possible to answer 
questions like your examples with Solr, but will make it hard to answer 
many other sorts of questions -- unlike an rdbms, it is difficult to set 
up a Solr index that can flexibly answer just about any question you 
through at it, particularly when you have hieararchical or otherwise 
multi-entity data.


If you are interested, the standard Solr tutorial is pretty good: 
http://lucene.apache.org/solr/tutorial.html





On 9/15/2011 6:39 PM, gary tam wrote:

Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
  unit testing
  2010-01-04   2010-01-06
  QA support
2010-01-07   2010-01-12
  implementation   2010-01-13
2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
  implementation   2010-05-02
2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
  Java Development  2011-09-01
 2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
   2010-01-22
Jane Doe  department Aimplementation  2010-05-02
   2010-09-28
Joe Doedepartment AJava Development  2011-09-01
   2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


You didn't tell us what your schema looks like, what fields with what types
are involved.

But similar to how you'd do it in your database, you need to find
'documents' that have a start date before your date in question, and an end
date after your date in question, to find the ones whose range includes your
date in question.

Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about
'John Smith', through another AND clause or an 'fq'.

But in general, if you've got a db with this info already, and this is all
you need, why not just use the db?  Multi-hieararchy data like this is going
to give you trouble in Solr eventually, you've got to arrange the solr
indexes/schema to answer your questions, and eventually you're going to have
two questions which require mutually incompatible schema to answer.

An rdbms is a great general purpose question answering tool for structured
data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:


Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for
project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

RE: need some guidance about how to configure a specific solr solution.

2011-08-12 Thread Jonathan Rochkind

I don't know anything about LifeRay (never heard of it), but it sounds like 
you've actually figured out what you need to know about LifeRay, all you've got 
left is: how to replicate the writer solr server content into the readers.

This should tell you how: 
http://wiki.apache.org/solr/SolrReplication

You'll need to find and edit the configuration files for the Solr's involved -- 
if you don't normally do that because LifeRay hides em from you, you'll need to 
find em. But it's a straightforward Solr feature (since 1.4), replication. 

From: Roman, Pablo [pablo.ro...@uhn.ca]
Sent: Thursday, August 11, 2011 12:10 PM
To: solr-user@lucene.apache.org
Subject: need some guidance about how to configure a specific solr solution.

Hi There,

I am IT and  work on a project based on Liferary 605 with solr-3.2 like the 
indexer/search engine.

I presently have only one server that is indexing and searching but reading the 
Liferay Support suggestions they point to the need of having:
- 2 to n SOLR read-server for searching from any member of the liferay cluster
- 1 SOLR write-server where all liferay cluster members write.

However, going down to detail to implement that on the liferay side I think I 
know how to do that which is inserting into the plugin for Solr this entries

 solr-spring.xml in the WEB-INF/classes/META-INF folder. Open this file in a 
text editor and you will see that there are two entries which define where the 
Solr server can be found by Liferay:

bean id=indexSearcher 
class=com.liferay.portal.search.solr.SolrIndexSearcherImpl property 
name=serverURL value=http://localhost:8080/solr/select; / /bean bean 
id=indexWriter class=com.liferay.portal.search.solr.SolrIndexWriterImpl 
property name=serverURL value=http://localhost:8080/solr/update; / /bean

However, I don't know how to replicate the writer solr server content into the 
readers. Please can you provide advice about that?

Thanks,
Pablo

This e-mail may contain confidential and/or privileged information for the sole 
use of the intended recipient.
Any review or distribution by anyone other than the person for whom it was 
originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and delete 
all copies.
Opinions, conclusions or other information contained in this e-mail may not be 
that of the organization.

RE: paging size in SOLR

2011-08-10 Thread Jonathan Rochkind

I would imagine the performance penalties with deep paging will ALSO be there 
if you just ask for 1 rows all at once though, instead of in, say, 100 row 
paged batches. Yes? No?

-Original Message-
From: simon [mailto:mtnes...@gmail.com] 
Sent: Wednesday, August 10, 2011 10:44 AM
To: solr-user@lucene.apache.org
Subject: Re: paging size in SOLR

Worth remembering there are some performance penalties with deep
paging, if you use the page-by-page approach. may not be too much of a
problem if you really are only looking to retrieve 10K docs.

-Simon

On Wed, Aug 10, 2011 at 10:32 AM, Erick Erickson
erickerick...@gmail.com wrote:
 Well, if you really want to you can specify start=0 and rows=1 and
 get them all back at once.

 You can do page-by-page by incrementing the start parameter as you
 indicated.

 You can keep from re-executing the search by setting your queryResultCache
 appropriately, but this affects all searches so might be an issue.

 Best
 Erick

 On Wed, Aug 10, 2011 at 9:09 AM, jame vaalet jamevaa...@gmail.com wrote:
 hi,
 i want to retrieve all the data from solr (say 10,000 ids ) and my page size
 is 1000 .
 how do i get back the data (pages) one after other ?do i have to increment
 the start value each time by the page size from 0 and do the iteration ?
 In this case am i querying the index 10 time instead of one or after first
 query the result will be cached somewhere for the subsequent pages ?

 JAME VAALET

Re: Remote backup of Solr index over low-bandwith connection

2011-08-09 Thread Jonathan Rochkind

You can use rsync to automatically only transfer the files that have 
changed. I don't think you'll have to home grow your own 'only transfer 
the diffs' solution, I think rsync will do that for you.


But yes, running an optimization, after many updates/deletes, will 
generally mean nearly everything has changed.


Solr's index, of course _is_ lucene, so your experience with lucene will 
be applicable to Solr.  Unless lucene or Solr have added new features 
since you last used it, but you're still using lucene, when you're using 
Solr.


On 8/9/2011 11:22 AM, Peter Kritikos wrote:

Hello, everyone,

My company will be using Solr on the server appliance we deliver to 
our clients. We would like to maintain remote backups of clients' 
search indexes to avoid rebuilding a large index when an appliance fails.


One of our clients backs up their data onto a remote server provided 
by a vendor which only provides storage space, so I don't believe it 
is possible for us to set up a remote slave server to use Solr's 
replication functionality. Because our client has a low-bandwidth 
connection to their backup server, we would like to minimize the 
amount of data transferred to the remote machine. Our Solr index 
receives commits every few minutes and will probably be optimized 
roughly once a day. Does our frequently modified index allow us to 
transfer an amount of data proportional to the number of new documents 
added to the search index daily? From my understanding, optimizing an 
index makes very significant changes to its files. Is there a way 
around this that I may be missing?


We have faced this problem in the past when our product used a 
Lucene-based search engine. We were unable to find a solution where we 
could only copy the diffs introduced to the index since the most 
recent backup, so we opted to make our indexing process faster. In 
addition to plain text, many of the documents that we are indexing are 
binary, e.g. Word, PDF. We cached the extracted text from these binary 
documents on the clients' backup servers, saving us the cost of 
extraction at index time. If we must pursue a solution like this for 
Solr, how else might we optimize the indexing process?


Much appreciated,
Peter Kritikos

RE: Multiple Cores on different machines?

2011-08-09 Thread Jonathan Rochkind

 tables. Others are suggesting 2 separate indexes on 2 different machines and
 using SOLRs capacity to combine cores and generate a third index that
 denormalizes the tables for us.

What capability is that, exaclty?  I think you may be imagining it. 

Solr does have some capability to distribute a single logical index across 
several different servers (sharding) -- this feature is mainly intended for 
scaling/performance, when your index gets too big for one server.  

I am not quite sure why it's so popular for people to come to the list trying 
to use sharding (or a mythical 'capacity to combine cores' which isn't quite 
the same thing) for entirley other problems, but it usually leads to pain. 

What problem is it you are trying to solve by splitting things into separate 
indexes on two differnet machines, and then later generating a third index 
aggregating the two indexes?  

I suppose you _could_ do that, first index into two separate indexes, and then 
have an indexer which reads from both of those two indexes, and adds to a third 
index.  But it wouldn't be using any 'capacity to combine cores' -- and  I 
don't believe there is any such 'capacity to combine cores' in such a way to 
somehow automatically build a third index from two source indexes with an 
entirely different schema that somehow manages to 'denormalize' the two source 
indexes. 

What are you trying to accomplish that makes you imagine this?

Re: Weighted facet strings

One kind of hacky way to accomplish some of those tasks involves 
creating a lot more Solr fields. (This kind of 'de-normalization' is 
often the answer to how to make Solr do something).


So facet fields are ordinarily not tokenized or normalized at all. But 
that doesn't work very well for matching query terms.  So if you want 
actual queries to match on these categories, you probably want an 
additional field that is tokenized/analyzed.  If you want to boost 
different category assignments differently, you probably want _multiple_ 
additional tokenized/analyzed fields.


So for instance, create separate analyzed fields for each category 
'weight', perhaps using the default 'text' analysis type.


categor_text_weight_1
category_text_weight_2
etc

Then use dismax to query, include all those category_text_* fields in 
the 'qf', and boost the higher weight ones more than the lower weight ones.


That will handle a number of your use cases, but not all of them.

Your first two cases are the most problematic:

filter: category=some_category_name, query: *.* - Results should be 
score by the above mentioned weight 


So Solr doesn't really work like that. Normally a filter does not effect 
the scoring of the actual results _at all_. But if you change the query to:


fq=category:some_category
q=some_category
defType=dismax
qf=category_text_weight1, category_text_weight2^10, 
category_text_weight3^20


THEN, with the multiple analyzed category_text_weight_* fields, as 
described above, I think it should do what you want. You may have to 
play with exactly what boost to give to each field.


But your second use case is still tricky.

Solr doesn't really do exactly what you ask, but by using this method I 
think you can figure out hacky ways to accomplish it.  I'm not sure if 
it will solve all of your use cases, but maybe this will give you a 
start to figuring it out.



On 8/5/2011 6:55 AM, Michael Lorz wrote:

Hi all,

I have documents which are (manually) tagged whith categories. Each
category-document relation has a weight between 1 and 5:

5: document fits perfectly in this category,
.
.
1: document may be considered as belonging to this category.


I would now like to use this information with solr. At the moment, I don't use
the weight at all:

field name=category type=string indexed=true stored=true
multiValued=true/

Both the category as well as the document body are specified as query fields
(str name=qf  in solrconfig.xml).


What I would like is the following:

- filter: category=some_category_name, query: *.*  - Results should be score by
the above mentioned weight
- filter: category=some_category_name, query: some_keyword - Results should be
scored by a combination of the score of 'some_keyword' and the above mentioned
weight
- filter: none, query: some_category_name - Documents with category
'some_category_name' should be found as well as documents which contain the term
'some_category_name'. Results should be scored by a combination of the score of
'some_keyword' and the above mentioned weight


Do you have any ideas how this could be done?

Thanks in advance
Michi

Re: Dispatching a query to multiple different cores

However, if you unify your schemas to do this, I'd consider whether you 
really want seperate cores/shards in the first place.


If you want to search over all of them together, what are your reasons 
to put them in seperate solr indexes in the first place?  Ordinarily, if 
you want to search over them all together, the best place to start is 
putting them in the same solr index.


Then, the distribution/sharding feature is generally your next step, 
only if you have so many documents that you need to shard for 
performance reasons. That is the intended use case of the 
distribution/sharding feature.


On 8/8/2011 4:54 PM, Erik Hatcher wrote:

You could use Solr's distributed (shards parameter) capability to do this.  
However, if you've got somewhat different schemas that isn't necessarily going 
to work properly.  Perhaps unify your schemas in order to facilitate this using 
Solr's distributed search feature?

Erik

On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote:


Hello there!

I have a multicore solr with 6 different simple cores and somewhat
different schemas and I defined another meta core which I would it to be a
dispatcher:  the requests are sent to simple cores and results are
aggregated before sending back the results to the user.

Any idea or hints how can I achieve this?
I am wondering whether writing custom SearchComponent or a custom
SearchHandler are good entry points?
Is it possible to acces other SolrCore which are in the same container as
the meta core?

Many thanks for your help.

Boubaker

Re: bug in termfreq? was Re: is it possible to do a sort without query?


Dismax queries can. But

sort=termfreq(all_lists_text,'indie+music')

is not using dismax.  Apparenty termfreq function can not? I am not familiar 
with the termfreq function.

To understand why you'd need to reindex, you might want to read up on how 
lucene actually works, to get a basic understanding of how different indexing 
choices effect what is possible at query time. Lucene In Action is a pretty 
good book.



On 8/8/2011 5:02 PM, Jason Toy wrote:

Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote:


Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its
because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..


All the results don't have the phrase indie music anywhere in their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how
this function parses your input but it might not understand your + escape
and
think it's one term constisting of exactly that.


If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the
shingle filter. Take care, it can significantly increase your index size
and
number of unique terms.




On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:

You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com


I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to

do

it without passing in data to the q paramater.  Is it possible to get
a

sorted


list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: Can Solr with the StatsComponent analyze 20+ million files?


On 8/8/2011 5:10 PM, Markus Jelsma wrote:

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?

If i remember correctly, it cannot.


Well, if you index things properly, you could an fq to only certain 
months, and then use StatsComponent on top.


But I'd agree with others that Solr is probably not the best tool for 
this job. Solr's primary area of competency is text indexing and text 
search, not mathematical calculation. If you need a whole lot of text 
indexing and a little bit of math too, you might be able to get 
StatsComponent to do what you need, although you'll probably run into 
some tricky parts becuase this isn't really Solr's focus.


But if you need a whole bunch of math and no text indexing at all -- use 
a tool that has math rather than text search as it's prime area of 
competency/focus, don't make things hard for yourself by using the wrong 
tool for the job.


(StatsComponent, incidentally, performs not-so-great on very large 
result sets, depending on what you ask it for).

Re: Indexing tweet and searching @keyword OR #keyword

It's the WordDelimiterFactory in your filter chain that's removing the 
punctuation entirely from your index, I think.


Read up on what the WordDelimiter filter does, and what it's settings 
are; decide how you want things to be tokenized in your index to get the 
behavior your want; either get WordDelimiter to do it that way by 
passing it different arguments, or stop using WordDelimiter; come back 
with any questions after trying that!



On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType

Re: Is there anyway to sort differently for facet values?


No, it can not. It just sorts alphabetically, actually by raw byte-order.

No other facet sorting functionality is available, and it would be 
tricky to implement in a performant way because of the way lucene 
works.  But it would certainly be useful to me too if someone could 
figure out a way to do it.


On 8/4/2011 2:43 PM, Way Cool wrote:

Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
I will try that though.

Can it handle the values below in the correct order?
Under 10
10 - 20
20 - 30
Above 30

Or
Small
Medium
Large
XL
...

My second question is that if Solr can't do that for the values above by
using facet.sort. Is there any other ways in Solr?

Thanks in advance,

YH

On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote:


have you looked at the facet.sort parameter? The index value is what I
think you want.

Best
Erick
On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com  wrote:

Hi, guys,

Is there anyway to sort differently for facet values? For example,

sometimes

I want to sort facet values by their values instead of # of docs, and I

want

to be able to have a predefined order for certain facets as well. Is that
possible in Solr we can do that?

Thanks,

YH

Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?

I'm not sure what you mean by index distribution, that could possibly 
mean several things.


But Solr has had a replication feature built into it from 1.4, that can 
probably handle the same use cases as rsync, but better.  So that may be 
what you want.


There are certainly other experiments going on involving various kinds 
of scaling distribution, that I'm not familiar with, including the 
sharding feature, that I'm not very familiar with. I don't know if 
anyone's tried to do anything with hadoop.




On 8/4/2011 2:52 PM, Way Cool wrote:

Hi, guys,

What's the best way (practice) to do index distribution at this moment?
Hadoop? or rsyncd (back to 3 years ago ;-)) ?

Thanks,

Yugang

Re: lucene/solr, raw indexing/searching

It depends. Okay, the source contains 4 harv. l. rev. 45 .

Do you want a user entered harv. to ALSO match harv (without the
period) in source, and vice versa? Or do you require it NOT match? Or do
you not care?

The default filter analysis chain will index 4 harv. l. rev. 45
essentially as 4;harv;l;rev;45. A phrase search for
4 harv. l. rev. 45 will match it, but so will a phrase search for 4
harv l rev 45 , and in fact so will a phrase search for 4 harv. l. rev45

That could be good, or it could be bad.

The point of the Solr analysis chain is to apply tokenization and
transformation at both index time and query time, so queries will match
source in the way you want. You can customize this analysis chain
however you want, in extreme cases even writing your own analyzers in
Java. If the out of the box default isn't doing what you want, you'll
have to spend some time thinking about how an inverted index like lucene
works, and what you want. You would need to provide a lot more
specifications/details for someone else to figure out what analysis
chain will do what you want, but I bet you can figure it our yourself
after reading up a bit and thinking up a bit.

See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On 8/4/2011 4:30 PM, dhastings wrote:

I have decided to use solr for indexing as well.

the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
3.56,
4 harv. l. rev. 45,
187-532,
3 llm 56,
5 unts 8,
6 u.n.t.s. 78,
father's obligation

i seem to keep running into issues getting this to work. the searching is
being done on a text field that is not stored.

--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dismax mm per field

There is not, and the way dismax works makes it not really that feasible 
in theory, sadly.


One thing you could do instead is combine multiple separate dismax 
queries using the nested query syntax. This will effect your relevancy 
ranking possibly in odd ways, but anything that accomplishes 'mm per 
field' will neccesarily not really be using dismax's disjunction-max 
relevancy ranking in the way it's intended.


Here's how you could combine two seperate dismax queries:

defType=lucene
q=_query_:{!dismax qf=field1 mm=100%}blah blah AND _query_:{!dismax 
qf=field2 mm=80%}foo bar


That whole q value would need to be properly URI escaped, which I 
haven't done here for human-readability.


Dismax has always got an mm, there's no way to not have an mm with 
dismax, but mm 100% might be what you mean. Of course, one of those 
queries could also not be dismax at all, but ordinary lucene query 
parser or anything else. And of course you could have the same query 
text for nested queries repeating eg blah blah in both.




On 8/3/2011 11:24 AM, Dmitriy Shvadskiy wrote:

Hello,
Is there a way to apply (e)dismax mm parameter per field? If I have a query
field1:(blah blah) AND field2:(foo bar)

is there a way to apply mm only to field2?

Thanks,
Dmitriy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strategies for sorting by array, when you can't sort by array?

There's no great way to do this. I understand your problem as: It's a 
multi-valued field, but you want to sort on whichever of those values 
matched the query, not on the values that didn't. (Not entirely clear 
what to do if the documents are in the result set becuse of a match in 
an entirely different field!)


I would sometimes like to do that too, and haven't really been able to 
come up with any great way to do it.


Something involving facetting kind of gets you closer, but ends up being 
a huge pain and doesn't get  you (or at least me) all the way to 
supporting the interface I'd really want.


On 8/3/2011 10:39 AM, Olson, Ron wrote:

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Strategies for sorting by array, when you can't sort by array?

Not so much that it's a corner case in the sense of being unusual 
neccesarily (I'm not sure), it's just something that fundamentally 
doesn't fit well into lucene's architecture.


I'm not sure that filing a JIRA will be much use, it's really unclear 
how one would get lucene to do this, it would be signficant work to do, 
and it's unlikely any Solr developer is going to decide to spend 
signficant time on it unless they need it for their own clients.


On 8/3/2011 11:40 AM, Olson, Ron wrote:

*Sigh*...I had thought maybe reversing it would work, but that would require 
creating a whole new index, on a separate core, as the existing index is used 
for other purposes. Plus, given the volume of data, that would be a big deal, 
update-wise. What would be better would be to remove that particular sort 
option-button on the webpage. ;)

I'll create a Jira issue, but in the meanwhile I'll have to come up with something else. 
I guess I didn't realize how much of a corner case this problem is. :)

Thanks for the suggestions!

Ron

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org]
Sent: Wednesday, August 03, 2011 10:26 AM
To: solr-user@lucene.apache.org
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Hi Ron.
This is an interesting problem you have. One idea would be to create an index 
with the entity relationship going in the other direction.  So instead of one 
to many, go many to one.  You would end up with multiple documents with varying 
names but repeated parent entity information -- perhaps simply using just an ID 
which is used as a lookup. Do a search on this name field, sorting by a 
non-tokenized variant of the name field. Use Result-Grouping to consolidate 
multiple matches of a name to the same parent document. This whole idea might 
very well be academic since duplicating all the parent entity information for 
searching on that too might be a bit much than you care to bother with. And I 
don't think Solr 4's join feature addresses this use case. In the end, I think 
Solr could be modified to support this, with some work. It would make a good 
feature request in JIRA.

~ David Smiley

On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:


Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.



DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.

Re: Setting up Namespaces to Avoid Running Multiple Solr Instances

I think that Solr multi-core (nothing to do with CPU cores, just what 
it's called in Solr) is what you're looking for. 
http://wiki.apache.org/solr/CoreAdmin


On 8/3/2011 2:25 PM, Mike Papper wrote:

Hi, we run several independent websites on the same machines. Each site uses
a similar codebase for search. Currently each site contacts its own solr
server on a slightly different port. This means of course that we are
running several solr servers (each on their own port) on the same machine. I
would like to make this simpler by running just one server, listening on one
port. Can we do this and at the same time have the indexes and search data
separated for each web site?

So, I'm asking if I can namespace or federate the solr server. But by doing
so I would like to have the indexes etc. not comingled within the server.

Im new to solr so there might be a hiccup from the fact that currently each
solr server points to its own directory on a site-specific path (something
like /apps/site/solr/*) which contains the solr plugin (were using ruby on
rails). Can this be setup as a namespace (one for each web site) within the
single server instance?

Mike

Re: lucene/solr, raw indexing/searching

2011-08-02 Thread Jonathan Rochkind

In your solr schema.xml, are the fields you are using defined as text
fields with analyzers? It sounds like you want no analysis at all, which
probably means you don't want text fields either, you just want string
fields. That will make it impossible to search for individual tokens
though, searches will match only on complete matches of the value.

I'm not quite sure how to do what you want, it depends on exactly what
you want. What kind of searching do you expect to support? If you still
do want tokenization, you'll still want some analysis... but I'm not
quite sure how that corresponds to what you'd want to do on the lucene
end. What you're trying to do is going to be inevitably confusing, I
think. Which doesn't mean it's not possible. You might find it less
confusing if you were willing to use Solr to index though, rather than
straight lucene -- you could use Solr via the SolrJ java classes, rather
than the HTTP interface.

On 8/2/2011 11:14 AM, dhastings wrote:

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method. I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do? My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

I do not want any of solr or lucenes attempts to fix my content or my
queries. rev. needs to stay rev. and not turn into rev, 23.302
needs to stay as such, and NOT turn into 23302. this is for BOTH indexing
and searching.

any hints?

right now for indexing i have:

Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;

and for searching i have in my schema :

fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Thanks. Very much appreciated.

--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Jetty error message regarding EnvEntry in WebAppContext

2011-08-02 Thread Jonathan Rochkind


On 8/2/2011 11:42 AM, Marian Steinbach wrote:

Can anyone tell me how a working configuration for Jetty 6.1.22 would have
to look like?


You know that Solr distro comes with a jetty with a Solr in it, right, 
as an example application? Even if you don't want to use it for some 
reason, that would probably be the best model to look at for a working 
jetty with solr.


Or is the problem that you want a different version of jetty?

As it happens, I just recently set up a jetty 6.1.26 for another 
project, not for solr. It was kind of a pain not being too familiar with 
java deployment or jetty.  But I did get JDNI working, by following the 
jetty instructions here: http://docs.codehaus.org/display/JETTY/JNDI  
(It was a bit confusing to figure out what they were talking about not 
being familiar with jetty, but eventually I got it, and the instructions 
were correct.)


But if I wanted to run Solr in jetty, I'd start with the jetty that is 
distributed with solr, rather than trying to build my own.

Re: performance crossover between single index and sharding

2011-08-02 Thread Jonathan Rochkind

What's the reasoning  behind having three shards on one machine, instead 
of just combining those into one shard? Just curious.  I had been 
thinking the point of shards was to get them on different machines, and 
there'd be no reason to have multiple shards on one machine.


On 8/2/2011 1:59 PM, Burton-West, Tom wrote:

Hi Markus,

Just as a data point for a very large sharded index, we have the full text of 
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 
machines. Each machine has 3 shards. The size of each shard ranges between 
475GB and 550GB.  We are definitely I/O bound. Our machines have 144GB of 
memory with about 16GB dedicated to the tomcat instance running the 3 Solr 
instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. 
 We release a new index every morning and then warm the caches with several 
thousand queries.  I probably should add that our disk storage is a very high 
performance Isilon appliance that has over 500 drives and every block of every 
file is striped over no less than 14 different drives. (See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest 
response time goal is to keep 99th percentile response time for our application 
(i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time  300 ms
median response time   113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard size 
and optimum number of shards per machine, but that has remained on the back 
burner for a long time as other higher priority items keep pushing it down on 
the todo list.

We would be really interested to hear about the experiences of people who have 
so many shards that the overhead of distributing the queries, and 
consolidating/merging the responses becomes a serious issue.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

* 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide
some metrics for an index of many terabytes.


[..] At some extreme point there will be diminishing
returns and a performance decrease, but I wouldn't worry about that at all
until you've got many terabytes -- I don't know how many but don't worry
about it.

~ David

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
list archive at Nabble.com.

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind

Any changes you make related to stemming or normalization are likely
going to require a re-index, just how it goes, just how solr/lucene
works. What you can do just by normalizing at query time is limited,
almost any good solution to this type of problem is going to require
normalization at index time.

If you're going to be fiddling with a production solr, it pays to figure
out a workflow such that you can introduce indexing changes without
downtime, this is not the last time you'll have to do it.

On 8/1/2011 12:35 PM, thomas wrote:

Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context:
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind


On 8/1/2011 12:42 PM, Paul Libbrecht wrote:
Otherwise i need to backup the whole index and try to reindex 
overnight when

cms users are sleeping.

With some work you can do this using an extra solr that just pulls everything, 
then swaps the indexes (that needs a bit of downtime), then re-indexes the 
things changed during the night.
I feel this should be a standard feature of SOLR...



It sort of is, in the sense that you can do it with replication, with no 
downtime. (Although you'll need enough disk and RAM in the slave to warm 
the replicated index while still serving queries from the older index, 
for no downtime).


Reindex to a seperate solr (or seperate core), then have the actual 
production core set up as a slave, and have it replicate from master 
when the re-indexing is done.  You can have your relevant conf files 
(schema or solrconfig) set up to replicate too, so you get those new 
ones in production exactly when you get the new indexes they go with.


The replication features isn't exactly set up for this, so it gets a bit 
confusing. I set up the 'slave' with NO polling.  It still needs to be 
set up with config saying it's a slave though. And it still needs to 
have a 'master' URL in there, even though you can also supply/over-ride 
the master URL with a manual replicate command, if there's no master URL 
at all, Solr will refuse to start up.   So I config the master URL, but 
without any polling for changes. Then I manually issue an HTTP replicate 
command to slave only when I have a rebuilt index in master I want to 
swap in. It seems to be working.

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind


On 8/1/2011 1:40 PM, Mike Sokolov wrote:
If you want to avoid re-indexing, you could consider building a 
synonym file that is generated using your rule set, and then using 
that to expand your queries.  You'd need to get a list of all terms in 
your index and then process them to generate synyonyms.  Actually, I 
don't know how to get a list of all the terms without Java 
programming, though: is there a way?


The terms compoennt will give you a list of all terms, I think. 
http://wiki.apache.org/solr/TermsComponent


But this is getting awfully hacky and hard to maintain simply to avoid 
doing a re-index. I still think doing a re-index is a normal part of 
evolving your Solr configuration, and better to just get used to it (and 
figure out how to do it in production with no or minimal downtime) now.

Re: colocated term stats


Not sure if this will do what you want, but one way might be using facets.

Take the term you are interested in, and apply it as an fq.  Now the 
result set will include only documents that include that term.  So also 
request facets for that result set, the top 10 facets are the top 10 
terms that appear in that result set -- which is the top 10 terms that 
appear in documents together with your fq constraint. (Okay, you might 
need to look at 11, because one of the facet values will be the same 
term you fq constrained). You don't need to look at actual documents at 
all (rows=0), just facet response.


Make sense? Does that do what you want?

On 7/27/2011 9:12 PM, Twomey, David wrote:

Given a query term, is it possible to get from the index the top 10 collocated 
terms in the index.

ie:  return the top 10 terms that appear with this term based on doc count.

A plus would be to add some constraints on how near the terms are in the docs.

Re: Exact match not the first result returned

Keep in mind that if you use a field type that includes spaces (eg 
StrField, or KeywordTokenizer), then if you're using dismax or lucene 
query parsers, the only way to find matches in this field on queries 
that include spaces will be to do explicit phrase searches with double 
quotes.


These fields will, however, work fine with pf in dismax/edismax as per 
Hoss's example.


But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy 
of my searchable field. I use a pf on that field with a very high boost 
to try and boost truly complete matches, that match the entirety of 
the value.  It's not exactly 'exact', I still do some normalization, 
including flattening unicode to ascii, and normalizing 1 or more 
string-or-punctuation to exactly 1 one space using a char regex filter.


It seems to pretty much work -- this is just one of various relevancy 
tweaks I've got going on, to the extent that my relevancy has become 
pretty complicated and hard to predict and doesn't always do what I'd 
expect/intend, but this particular aspect seems to mostly pretty much work.


On 7/27/2011 10:55 PM, Chris Hostetter wrote:

: With your solution, RECORD 1 does appear at the top but I think thats just
: blind luck more than anything else because RECORD 3 shows as having the same
: score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
: like all three records returned with RECORD 1 being the first listing.

with omitNorms RECORD1 and RECORD3 have the same score because only the
tf() matters, and both docs contain the term frank exactly twice.

the reason RECORD1 isn't scoring higher even though it contains (as you
put it matchings 'Fred' exactly is that from a term perspective, RECORD1
doesn't actually match myname:Fred exactly, because there are in fact
other terms in that field because it's multivalued.

one way to indicate that you (only* want documents where entire field
values to match your input (ie: RECORD1 but no other records) would be to
use a StrField instead of a TextField or an analyzer that doesn't split up
tokens (lie: something using KeywordTokenizer).  that way a query on
myname:Frank would not match a document where you had indexed the value
Frank Stalone by a query for myname:Frank Stalone would.

in your case, you don't want *only* the exact field value matches, but you
want them boosted, so you could do something like copyField myname into
myname_str and then do...

   q=+myname:Frank myname_str:Frank^100

...in which case a match on myname is required, but a match on
myname_str will greatly increase the score.

dismax (and edismax) are really designed for situations like this...

   defType=dismax  qf=myname  pf=myname_str^100  q=Frank



-Hoss

Re: Possible to use quotes in dismax qf?

It's not clear to me why you would try to do that, I'm not sure it makes 
a lot of sense.


You want to find all documents that have sail boat as a phrase AND 
have sail somewhere in them AND have boat somewhere in them?  That's 
exactly the same as just all documents that have sail boat as a phrase 
-- such documents will neccesarily include sail and boat, right?  So 
why not just ask for q=sail boat?


What are you actually trying to do?

Maybe dismax 'pf', which relevancy boosts documents which have your 
input as a phrase, si what you really want?  Then you'd just search for 
q=sail boat, but documents that included sail boat as a phrase 
would be boosted, at the boost you specify.


On 7/28/2011 10:00 AM, O. Klein wrote:

I want to do a dismax search to search for original query and this query as a
phrasequery:

q=sail boat needs to be converted to dismax query q=sail boat sail boat

qf=title^10 content^2

What is best way to do this?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Possible-to-use-quotes-in-dismax-qf-tp3206762p3206762.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index