Re: camel-casing and dismax troubles

2009-05-13 Thread Geoffrey Young
On Wed, May 13, 2009 at 6:23 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
 ge...@modperlcookbook.org wrote:
 hi all :)

 I'm having trouble with camel-cased query strings and the dismax handler.

 a user query

  LeAnn Rimes

 isn't matching the indexed term

  Leann Rimes

 This is the camel-case case that can't currently be handled by a
 single WordDelimiterFilter.

 If the indexeddoc had LeAnn, then it would be indexed as
 le,ann/leann and hence queries of both forms le ann and
 leann would match.

 However since the indexed term is simply leann, a
 WordDelimiterFilter configured to split won't match (a search for
 LeAnn will be translated into a search for le ann.

but the concatparts and/or concatall should handle splicing the tokens
back together, right?


 One way to work around this now is to do a copyField into another
 field that catenates split terms in the query analyzer instead of
 generating/splitting, and then search across both fields.

yeah, unforunately, that's not an option for me :)


 BTW, your parsed query below shows you turned on both catenation and
 generation (or perhaps preserveOriginal) for split subwords in your
 query analyzer.  Unfortunately this configuration doesn't work due to
 the ambiguity of what it means to have multiple terms at the same
 position (this is the same problem for multi-word synonyms at query
 time).  The query shown below looks for leann or le followed by
 ann and hence an indexed term of leann won't match.

ugh.  ok, thanks for letting me know.

I'm not using the same concat parameters on the index as the query
based on the solr wiki docs.  but I've always wondered if that was a
good idea.  I'll see if matching them up helps at all.

thanks.  I'll let you know what I find.

--Geoff


camel-casing and dismax troubles

2009-05-12 Thread Geoffrey Young
hi all :)

I'm having trouble with camel-cased query strings and the dismax handler.

a user query

 LeAnn Rimes

isn't matching the indexed term

 Leann Rimes

even though both are lower-cased in the end.  furthermore, the
analysis tool shows a match.

the debug query looks like

 parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le)
ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (),

I have a feeling it's due to how the broken up tokens are added back
into the token stream with PreserveOriginal, and some strange
interaction between that order and dismax, but I'm not entirely sure.

configs follow.  thoughts appreciated.

--Geoff

  fieldType name=search-en class=solr.TextField
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory /
  filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=false
words=stopwords-en.txt/
/analyzer

analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory /
  filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=0
  catenateNumbers=0
  catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=false
words=stopwords-en.txt/
/analyzer
  /fieldType


dismax and WordDelimiterFilterFactory+PreserveOriginal

2009-03-16 Thread Geoffrey Young
hi all :)

I have two filters combined with dismax on the query side:

  WordDelimiterFilterFactory { preserveOriginal=1,
generateNumberParts=1, catenateWords=0, generateWordParts=1,
catenateAll=0, catenateNumbers=0}

followed by lowecase filter factory.  the analyzer shows the phrase

  gUYS and dOLLS

being tokenized as

  guys  uys and dolls   olls
  g d

and matching an index where everything is like you would expect
(lowercased, etc).

anyway, dismax is failing to get a match, even though the analyzer says
all is ok.  dismax reports the following:

  rawquerystring:gUYS and dOLLS,
  querystring:gUYS and dOLLS,

  parsedquery:+((DisjunctionMaxQuery((search:\(guys g) uys\))
DisjunctionMaxQuery((search:\(dolls d) olls\)))~2) (),

  parsedquery_toString:+(((search:\(guys g) uys\) (search:\(dolls
d) olls\))~2) (),

so it seems like PreserveOriginal is mucking with the token order in a
way that makes dismax very unhappy.

thoughts?

--eoff


filtering on blank OR specific range

2008-11-19 Thread Geoffrey Young
hi all :)

I'm having difficultly filtering my documents when a field is either
blank or set to a specific value.  I would have thought this would work

  fq=-Type:[* TO *] OR Type:blue

which I would expect to find all document where either Type is undefined
or Type is blue.  my actual result set is zero.

using a similar filter

  fq=-Type:[* TO *] OR OtherThing:cat

does what I would expect (documents with undefined type or documents
with cats), so it feels like solr is getting confused with the range
negation and ORing, but only when the field is the same.  adding various
parentheses makes no difference.

I know this is kind of nebulous sounding, but I was hoping someone would
look at this and go you're doing it wrong.  your filter should be...

the field is defined as

  field name=Type type=string indexed=true stored=true
multiValued=true/

if it matters.

tia

--Geoff


Re: filtering on blank OR specific range

2008-11-19 Thread Geoffrey Young


Lance Norskog wrote:
 Try:   Type:blue OR -Type:[* TO *] 
 
 You can't have a negative clause at the beginning. Yes, Lucene should barf
 about this.

I did try that, before and again now, and still no luck.

anything else?

--Geoff


Re: solr 1.3 snapshooter doesn't work, commit never ending

2008-10-15 Thread Geoffrey Young


sunnyfr wrote:
 I tried last evening before leaving and this
 morning time elapsed was very important like you can notice above and no
 snapshot, no error in the logs.

I'm actually having a similar trouble.  I've enabled postCommit and
postOptimize hooks with an absolute path to snapshooter.

not only are the snapshots not created, I can't even see calls (or
errors) in catalina.out.

it's supposed to be this easy, right?

  listener event=postCommit class=solr.RunExecutableListener
str name=exe/path/to/solr/bin/snapshooter/str
bool name=waittrue/bool
  /listener

nothing else?  of course, calling it manually is just fine.

--Geoff


Re: using DataImportHandler instead of POST?

2008-10-03 Thread Geoffrey Young


Chris Hostetter wrote:
 : I chugg away at 1.5 million records in a single file, but solr never
 : commits.  specifically, it ignores my autocommit settings.  (I can
 : commit separately at the end, of course :)
 
 the way the autocommit settings work is soemthing i always get confused by 
 -- the autocommit logic may not kick in untill the add is 
 finished, regardless of how many docs are in it -- but i'm not certain 
 9and if i'm correct, i'm not sure if that's a bug or a feature)

ok, that makes sense.

fwiw, I tried to break the records into add chunks in the same file
but solr complained about multiple root entities.  I knew you couldn't
mix adds and deletes (rats ;) but I figured multiple add blocks would be
ok.  I guess not :)

 
 this may be a motivating reason to use DIH in your use case even though 
 you've already got it in the XmlUpdateRequestHandler format.

yeah, I'll check.  though I don't know what I'd do with trying to figure
out which records were committed and which weren't...

 
 : but I might be misunderstanding autocommit.  I have it set as the
 : default solrconfig.xml does, in the updateHandler section (mapped to
 : UpdateHandler2) but /update is mapped to XmlUpdateRequestHandler.
 : should I be shuffling some things around?
 
 due to some unfortunately naming decisions several years ago an update 
 Handler and a Request handler that does updates aren't the same thing 
 ... updateHandler (which whould always be DirectUpdateHandler2) is the 
 low level internal code that is responsible for actually making the index 
 modiciations -- XmlUpdateRequestHandler (or DataImportHandler) parses the 
 raw input and hands off to DirectUpdateHandler2 to make the changes.

ok, thanks.  I kind of implied that from the wiki, but it was still
confusing, so thanks for the clarification.

--Geoff


Re: using DataImportHandler instead of POST?

2008-10-01 Thread Geoffrey Young


Geoffrey Young wrote:
 
 Chris Hostetter wrote:
 : I have a well-formed xml file, suitable for POSTting to solr.  that
 : works just fine.  it's very large, though, and using curl in production
 : is so very lame.  is there a very simple config that will let solr just
 : slurp up the file via the DataImportHandler?  solr already has

 You don't even need DIH for this, just enableRemoteStreaming and use the 
 stream.file param and you can load the file from local disk...

  http://wiki.apache.org/solr/ContentStream
 
 this is the solution I think I'm going to go with - it seems to work
 perfectly.

well, with one exception.

I chugg away at 1.5 million records in a single file, but solr never
commits.  specifically, it ignores my autocommit settings.  (I can
commit separately at the end, of course :)

but I might be misunderstanding autocommit.  I have it set as the
default solrconfig.xml does, in the updateHandler section (mapped to
UpdateHandler2) but /update is mapped to XmlUpdateRequestHandler.
should I be shuffling some things around?

thanks.

--Geoff


Re: using DataImportHandler instead of POST?

2008-09-29 Thread Geoffrey Young


Chris Hostetter wrote:
 : I have a well-formed xml file, suitable for POSTting to solr.  that
 : works just fine.  it's very large, though, and using curl in production
 : is so very lame.  is there a very simple config that will let solr just
 : slurp up the file via the DataImportHandler?  solr already has
 
 You don't even need DIH for this, just enableRemoteStreaming and use the 
 stream.file param and you can load the file from local disk...
 
   http://wiki.apache.org/solr/ContentStream

this is the solution I think I'm going to go with - it seems to work
perfectly.

thanks (to both of you).

--Geoff


using DataImportHandler instead of POST?

2008-09-28 Thread Geoffrey Young
hi all :)

I'm sorry I need to ask this, but after reading and re-reading the wiki
I don't see a clear path...

I have a well-formed xml file, suitable for POSTting to solr.  that
works just fine.  it's very large, though, and using curl in production
is so very lame.  is there a very simple config that will let solr just
slurp up the file via the DataImportHandler?  solr already has
everything it needs in schema.xml, so I don't think this would be very
hard... if I fully understood the DataImportHandler :)

tia

--Geoff


Re: spellchecker problems (bugs)

2008-07-25 Thread Geoffrey Young



This issue has been fixed in the trunk. Can you please use the latest trunk
code and try?


current trunk looks good.

thanks!

--Geoff


Re: Multiple search components in one handler - ie spellchecker

2008-07-25 Thread Geoffrey Young



Andrew Nagy wrote:

Hello - I am attempting to add the spellCheck component in my
search requesthandler so when a users does a search, they get the
results and spelling corrections all in one query just like the way
the facets work.

I am having some trouble accomplishing this - can anyone point me to
documentation (other than
http://wiki.apache.org/solr/SpellCheckComponent) on how to do this or
an example solrconfig that would do this correctly?


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200806.mbox/[EMAIL 
PROTECTED]

in general, just add the

  arr name=last-components
strspellcheck/str
  /arr

bit to your existing handler after following setup in the twiki docs.

you can ignore the part about the exceptions, as that has been fixed in 
trunk.


HTH

--Geoff


Re: Multiple search components in one handler - ie spellchecker

2008-07-25 Thread Geoffrey Young



Andrew Nagy wrote:

Thanks for getting back to me Geoff.  Although, that is pretty much
what I have.  Maybe if I show my solrconfig someone might be able to
point out what I have incorrect?  The problem is that nothing related
to the spelling options are show in the results, just the normal
expected search results.


right.  the spellcheck component does not issue a separate query *after* 
running the spellcheck, it merely offers suggestions in parallel with 
your existing query.


the results are more like

  below are the results for $query.  did you mean $suggestions?

HTH

--Geoff




Re: spell-checker and faceting

2008-07-23 Thread Geoffrey Young



dudes dudes wrote:
Hi, 

I'm trying to couple spell-checking mechanism with faceting in one url statement.. I can get the spell check right, but the facet doesn't work when it's combined 
with spell-checker... 

http://localhost:8080/solr/spellCheckCompRH?q=smathspellcheck.q=smathspellcheck=truespellcheck.build=trueselect?q=smathrows=0facet=truefacet.limit=1facet.field=firstname 


it corrects smath to Smith, but doesn't facet it.


I was able to get faceting working without issue.  it seems to me your 
query string is off - note the 'select?q=smath' in the middle of your 
query.  I'd try again with that part removed.


also note you only need to spellcheck.build=true once, not on each request.

--Geoff


Re: spellchecker problems (bugs)

2008-07-23 Thread Geoffrey Young



Jonathan Lee wrote:

I don't see the patch attached to my original email either -- does solr-user
not allow attachments?

This is ugly, but here's the patch inline:


issue created in jira:

  https://issues.apache.org/jira/browse/SOLR-648

--Geoff


Re: spellchecker problems (bugs)

2008-07-22 Thread Geoffrey Young



Shalin Shekhar Mangar wrote:

The problems you described in the spellchecker are noted in
https://issues.apache.org/jira/browse/SOLR-622 -- I shall create an issue to
synchronize spellcheck.build so that the index is not corrupted.


I'd like to discuss this a little...

I'm not sure that I want to rebuild the spelling index each time the 
underlying data index changes - the process takes very long and my 
updates are frequent changes to non-spelling related data.


what I'd really like is for a change to my index to not cause an 
exception.  IIRC the old way of using a spellchecker didn't work like 
this at all - I could completely rm data/index and leave data/spell in 
place, add new data, not issue cmd=build and the spelling parts still 
worked just fine (albeit with old data).


not to say that SOLR-622 isn't a good idea (it is) but I don't really 
think the entire solution is keeping the spellcheck index in sync.  do 
they need to be kept in sync for things not to implode on me?


--Geoff


Re: problems with SpellCheckComponent

2008-07-08 Thread Geoffrey Young



When I made:
http://localhost:8080/solr/spellCheckCompRH?q=*:*spellcheck.q=ruckspellcheck=true

I have this exception:

Estado HTTP 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:217)



I see this all the time - to the point where I wonder how stable the new 
component is.


I've *think* I've traced it to

  o the presence of both q *and* spellcheck.q
  o and *any* restart of solr without re-issuing spellcheck.build=true

I haven't been using any form of spellchecker for long, but I'm 
reasonably sure that I didn't need to rebuild on every restart.  I also 
used to think it was changes to schema.xml (and not a simple restart) 
that caused the issue, but I've seen the exception with no changes. 
I've also seen the exception pop up without a restart when the server 
sits overnight (last query of the day ok, go to sleep, query again in 
the morning and *boom*)


but regardless of restart issues, I've never seen it happen with just 
the q or just the spellcheck.q fields in my query - it's always when 
they're both there.


--Geoff


Re: problems with SpellCheckComponent

2008-07-08 Thread Geoffrey Young



Shalin Shekhar Mangar wrote:

Hi Geoff,

I can't find anything in the code which would give this exception when both
q and spellcheck.q is specified. Though, this exception is certainly
possible when you restart solr. Anyways, I'll look into it more deeply.


great, thanks.



There are a few ways in which we can improve this component. For example a
lot of this trouble can go away if we can reload the spell index on startup
if it exists or build it if it does not exist (SOLR-593 would need to be
resolved for this). With SOLR-605 committed, we can now add an option to
re-build the index (built from Solr fields) on commits by adding a listener
using the API. There are a few issues with collation which are being handled
in SOLR-606.

I'll open new issues to track these items. Please bear with us since this is
a new component and may take a few iterations to stabilize. Thank you for
helping us find these issues :)


np - this is a great feature to have and it's going to save me some 
effort as we prepare for deployment, so it's worth taking the time to 
work out the bugs.


thanks for your effort.

--Geoff


Re: SpellCheckerRequestHandler qt parameter

2008-06-27 Thread Geoffrey Young


I had null pointer exceptions left and right while composing this 
email... then I added spellcheck.build=true to one and they went away. 
do you need to rebuild the spelling index every time you alter (certain 
parts) of solrconfig.xml?  it was very consistent as reported below, but 
after simply issuing a rebuild I can't reproduce the null pointer.


this seems to happen every time I stop and start solr.

  o q=termspellcheck.q=term - ok

  o stop  start solr (we're using tomcat 6.0.16)

  o q=termspellcheck.q=term - null pointer 
(SpellCheckComponent.getTokens(SpellCheckComponent.java:215))


  o q=$term - ok

  o q=termspellcheck.q=termspellcheck.build=true - ok

--Geoff


Re: SpellCheckerRequestHandler qt parameter

2008-06-26 Thread Geoffrey Young

Norberto Meijome wrote:
 Hi there,

 Short and sweet :
 Is SCRH intended to honour qt= ?


 longer...
 I'm testing the newest SCRH ( SOLR-572), using last night's nightly build.

 I have defined a 'dismax' request handler which searches across a number
of fields. When I use the SCRH in a query, and I pass the qt=dismax
parameter, it is ignored. Furthermore, the default field is shown as being
used when I add debugQuery=true.

 I could replace some of dismax's capabilities with a longer query string ,
but some parameters such as mm don't seem to exist with the standard handler.

it seems like it ought to work as a component of your dismax handler.  this
works for me:

   requestHandler name=dismax class=solr.DisMaxRequestHandler 
 lst name=defaults
   str name=echoParamsnone/str
   str name=indentoff/str
   str name=qfsearch-en/str
 /lst
 lst name=invariants
   str name=mm100%/str
   str name=wtjson/str
 /lst
 lst name=appends
   str name=fqType:Event/str
 /lst
 arr name=last-components
   strspellcheck/str
 /arr
   /requestHandler

   searchComponent name=spellcheck
class=org.apache.solr.handler.component.SpellCheckComponent
   ... from docs ...
   /searchComponent

well *almost* - it works most excellently with q=$term but when I add
spellchecker.q=$term things implode:

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215)
 at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:183)
 at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
 at...

not being a java guy I need to use solr out of the box, and adding
spellchecker.q makes my multi-word terms checked at the phrase level
(mickey mouse) instead of at the word level (mickey mouse) which  is
the behavior I'm seeking.  the docs make it sound like I could write  my own
SpellingQueryConverter, but... well, they also use both q and
spellchecker.q at the same time, so it shouldn't implode like that :)

anyway, HTH

--Geoff





Re: SpellCheckerRequestHandler qt parameter

2008-06-26 Thread Geoffrey Young



Grant Ingersoll wrote:


On Jun 26, 2008, at 5:25 PM, Geoffrey Young wrote:




well *almost* - it works most excellently with q=$term but when I add
spellchecker.q=$term things implode:

HTTP Status 500 - null java.lang.NullPointerException at
org
.apache
.solr
.handler
.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215) at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:183) 
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156) 
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) 
at...


not being a java guy I need to use solr out of the box, and adding
spellchecker.q makes my multi-word terms checked at the phrase level
(mickey mouse) instead of at the word level (mickey mouse) 
which  is
the behavior I'm seeking.  the docs make it sound like I could write  
my own

SpellingQueryConverter, but... well, they also use both q and
spellchecker.q at the same time, so it shouldn't implode like that :)



What's your searchComponent look like for the SpellCheckComponent, 
exactly?


sorry for the long post - first some trivial stuff...

I had null pointer exceptions left and right while composing this 
email... then I added spellcheck.build=true to one and they went away. 
do you need to rebuild the spelling index every time you alter (certain 
parts) of solrconfig.xml?  it was very consistent as reported below, but 
after simply issuing a rebuild I can't reproduce the null pointer.


my original problem was...

request to

/solr/select?q=celin+dionqf=search-enqt=Search::Model::JSON::Search::Scanspellcheck=trueindent=onechoParams=all

succeeds as

{
 responseHeader:{
  status:0,
  QTime:9,
  params:{
fq:Type:Event OR Type:Attraction OR Type:Venue,
echoParams:all,
indent:on,
qf:search-en,
defType:dismax,
spellcheck:true,
echoParams:all,
indent:on,
q:celin dion,
qf:search-en,
qt:Search::Model::JSON::Search::Scan,
mm:100%,
facet:false,
start:0,
wt:json,
rows:0}},
 response:{numFound:59,start:0,docs:[]
 },
 spellcheck:{
  suggestions:[
celin,[
 numFound,1,
 startOffset,0,
 endOffset,5,
 suggestion,[celina]]]}}

request to

/solr/select?q=celin+dionqf=search-enqt=Search::Model::JSON::Search::Scanspellcheck=trueindent=onechoParams=allspellcheck.q=celin+dion

implodes with

null java.lang.NullPointerException at 
org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:215) 
at...


if it makes a difference, it's svn trunk from last night + SOLR-14 applied.

thanks for taking the time - I'm hoping this isn't now a wild goose chase :)

--Geoff

solrconfix.xml

  requestHandler name=Search::Model::JSON::Search::Scan 
class=solr.DisMaxRequestHandler 

lst name=defaults
  str name=echoParamsnone/str
  str name=indentoff/str
  str name=qfsearch sAttractionName sVenueName/str
/lst
lst name=invariants
  str name=mm100%/str
  str name=wtjson/str
  int name=start0/int
  int name=rows0/int
  str name=facetfalse/str
/lst
lst name=appends
  str name=fqType:Event OR Type:Attraction OR Type:Venue/str
/lst
arr name=last-components
  strspellcheck/str
/arr
  /requestHandler

  searchComponent name=spellcheck 
class=org.apache.solr.handler.component.SpellCheckComponent

lst name=defaults
  str name=spellcheck.onlyMorePopularfalse/str
  str name=spellcheck.extendedResultsfalse/str
/lst
lst name=invariants
  str name=spellcheck.count5/str
/lst
str name=queryAnalyzerFieldTypespell/str

lst name=spellchecker
  str name=namedefault/str
  str name=fieldspell/str
  str name=spellcheckIndexDirdefaultspell/str
/lst
  /searchComponent

  queryConverter name=queryConverter 
class=org.apache.solr.spelling.SpellingQueryConverter/


schema.xml:

fieldType name=spell class=solr.TextField 
positionIncrementGap=100

  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


Re: missing document count?

2008-06-18 Thread Geoffrey Young



Chris Hostetter wrote:

: not hard, but useful information to have handy without additional
: manipulations on my part.

: our pages are the results of multiple queries.  so, given a max number of
: records per page (or total), the rows asked of query2 is max - query1, of

in the common case, counting the number of docs in a result is just as 
easy as reading some attribute containing the count. 


I suppose :)  in my mind, one (potentially) requires just a read, while 
the other requires some further manipulations.  but I suppose most 
modern languages have optimizations for things like array size :)


It sounds like you 
have a more complicated case where what you really wnat is the count of 
how many docs there are in the entire response 


I don't know how complex it is to ask for documents in the response, but 
yes :)


(ie: multiple result 
sections) ... 


multiple results from multiple queries, not a single query.

but really, I wasn't planning on having anyone (solr or otherwise) 
solving my needs.  I just find it odd that I need to discern the number 
of returned results.


that count is admitedly a little more work but would also be 
completley useless to most clients if it was included in the response 


perhaps :)

(just as the number of fields in each doc, or the total number of strings 
in the response) ... there is a lot of metadata that *could* be included 
in the response, but we don't bother when the client can compute that 
metadata just as easily as the server -- among other things, it helps keep 
the response size smaller.


agreed - smaller is better.

as for client as easily as a the server, I assumed that solr was keeping 
track of the document count already, if only to see when the number of 
documents exceeds the rows parameter.  if so, all the people who care 
about number of documents in the result (which, I'll assume, is more 
than those who care about total strings in the response ;) are all 
re-computing a known value.




This was actually one of the orriginal guiding principles of Solr: support 
features that are faster/cheaper/easier/more-efficient on the central 
server then they would be on the clients (sorting, docset caching, 
faceting, etc...)


sure, I'll buy that.  but in my mind it was only exposing something solr 
already was calculating anyway.


regardless, thanks for taking the time :)

--Geoff


Re: searching only within allowed documents

2008-06-11 Thread Geoffrey Young




Solr allows you to specify filters in separate parameters that are
applied to the main query, but cached separately.

q=the user queryfq=folder:f13fq=folder:f24


I've been wanting more explanation around this for a while, so maybe now 
is a good time to ask :)


the cached separately verbiage here is the same as in the twiki, but I 
don't really understand what it means.  more precisely, I'm wondering 
what the real performance, caching, etc differences are between


  q=fielda:foo+fieldb:barmm=100%

and

  q=fielda:foofq=fieldb:bar

my situation is similar to the original poster's in that documents 
matching fielda is very large and common (say theaters across the world) 
while fieldb would narrow it considerably (one by country, then one by 
zipcode, etc).


thanks

--Geoff




adding expand=true to WordDelimiterFilter

2008-05-19 Thread Geoffrey Young

hi :)

I'm having an interesting problem with my data.  in general, I want the 
results of the WordDelimiterFilter for better matching, but there are 
times when it's just too aggressive.  for example


  boys2men = boys 2 men (good)
  p!nk = pnk (maybe)
  !!!  = (nothing - bad)

there's a special place for bands who name themselves just punctuation 
marks :)


anyway, one way around this is synonyms.  but if I do that then I need 
to run the synonym filter multiple times.  the first might expand


  !!!  = chk chk chk
  p!nk = pink

while the next would need to run after the WordDelimiterFilter for

  boys 2 men = boyz II men

I'd really like to avoid multiple passes (and multiple synonym files) if 
at all possible, but that's the solution I'm faced with currently...


unless an 'expand' option were added to the WordDelimiterFilter, in 
which case I'd have


  p!nk = p!nk pnk

after it runs, so I could just apply the synonyms once.  or maybe 
there's another solution I'm missing.


would it be difficult (or desirable) to add an expand option?

--Geoff



Re: adding expand=true to WordDelimiterFilter

2008-05-19 Thread Geoffrey Young



Chris Hostetter wrote:
by expand=true it sounds like you mean you are looking for a way to 
preserve the orriginal term without any characteres removed.


yes, that's it.



This sounds like SOLR-14 ... you might want to take a look at it, and see 
if the patch is still useable, and if not see if you can bring it up to

date.


I'm working with a team that deploys this all for me, so I've asked 
them.  I'll report back.


thanks for pointing it out :)

--Geoff


Re: token concat filter?

2008-05-08 Thread Geoffrey Young



Otis Gospodnetic wrote:

Geoff,

Whether synonyms are applied at index time or query time is
controlled via schema.xml - it depends on where you put the synonym
factory, whether in the index-time or query-time section of a
fieldType.  Synonyms are read once on start, I believe.  It might be
good to have them read at index reader open time, as is done with
elevate component...


I'm looking a bit more into this now.

I don't think you need synonyms applied at both query and index time if 
you're using expand - one or the other ought to work properly.  in fact, 
I suspect I'm the last person to figure this out ;)


the question is, then, which is the more efficient place to apply them?

my first inclination is to apply them (and other similar expanding 
mechanisms) to just the index so that the expansion happens only once 
and is held in (an efficient) index as opposed to manipulating every query.


the SpellCheckerRequestHandler example on the wiki has the opposite 
configuration, expanding synonyms on (only) the query.


thoughts on which approach is the more efficient one?

--Geoff


Re: token concat filter?

2008-05-08 Thread Geoffrey Young



Otis Gospodnetic wrote:

There is actually a Wiki page explaining this pretty well... have you
seen it?


I guess not.  I've been reading the wiki, but the trouble with wiki's 
always seems to be (for me) finding stuff.  can you point it out?




Index-time expansion means larger indices and inability to easily
change synonyms (e.g. you thought of a new synonym for fish and
want to add it to the already indexed docs).


yes, I've thought of the latter limitation.  due to other factors, I'm 
hoping to re-index all of our documents from scratch nightly, so that's 
not much of a concern.


--Geoff


Re: Sort results on a field not ordered

2008-05-02 Thread Geoffrey Young



Erik Hatcher wrote:
What field type is chapterTitle?   I'm betting it is an analyzed field 
with multiple values (tokens/terms) per document.  To successfully sort, 
you'll need to have a single value per document - using copyField can 
help with this to have both a searchable field and a sortable version.


does this apply to facet fields as well?  I noticed that that if I set 
facet.sort=true the results are indeed sorted by count... until the 
counts are the same, after which they are in random order (instead of 
ascii alpha).


--Geoff


token concat filter?

2008-05-01 Thread Geoffrey Young

hi :)

I'm looking for a filter that will compress all tokens into a single 
token.  the WordDelimiterFilterFactory does it for tokens it finds 
itself, but not ones passed to it.


basically, I'm trying to match

  Radiohead

in the index with

 radio head

in the query.  if it were spelled RadioHead or Radio-head in the index 
I'd find it, but as it is I'm missing it... unless I could squish all 
the query terms into a single token.  or maybe there's another route I 
haven't thought about yet.



--Geoff



Re: token concat filter?

2008-05-01 Thread Geoffrey Young



Yonik Seeley wrote:

If there are only a few such cases, it might be better to use synonyms
to correct them.


unfortunately, there are too many to handle this way.


Off the top of my head there's no concatenating token filter, but it
wouldn't be hard to make one.


hmm, ok.  I'm not a java guy, so I'll try the PatternTokenizerFactory 
before trying to write my own.  thanks :)


speaking of synonyms... will changes to synonyms.txt (and the other 
files) take affect on each re-indexing, or does the solr server read it 
once on load then hold on to it until restart?


--Geoff


Re: token concat filter?

2008-05-01 Thread Geoffrey Young



Walter Underwood wrote:

I've been doing it with synonyms and I have several hundred of them.


I'm dealing mostly with proper names, so I expect more like 80k of them 
for our data :)




Concatenating bi-word groups is pretty useful for English. We have a
habit of gluing words together. database used to be two words.
Dictionaries still think it should be web server.


:)

--Geoff


Re: token concat filter?

2008-05-01 Thread Geoffrey Young



Walter Underwood wrote:

I doubt it would be that many. I recommend tracking the searches and
the clicks, and working on queries with low clickthrough.


the trouble is I'm in a dynamic biz - last weeks popular clicks are very 
different from this weeks, so by the time I analyze last weeks popular 
misses it's too late to add them.  but the non-space issue represents 
10% of my misses consistently over time.




Here are a few of mine from that sort of analysis:

ghost dog = ghost dog, ghostdog
ghost hunters = ghost hunters, ghosthunters
ghost rider = ghost rider, ghostrider
ghost world = ghost world, ghostworld
ghostbusters = ghostbusters, ghost busters

I don't see as many in personal names. Mostly, things like De Niro
and DiCaprio.


hannahmontana?

;)

--Geoff




Re: token concat filter?

2008-05-01 Thread Geoffrey Young



Otis Gospodnetic wrote:

Geoff,

Whether synonyms are applied at index time or query time is
controlled via schema.xml - it depends on where you put the synonym
factory, whether in the index-time or query-time section of a
fieldType.  Synonyms are read once on start, I believe.  It might be
good to have them read at index reader open time, as is done with
elevate component...


coolio, thanks.

--Geoff


Re: Got parseException when search keyword AND on a text field

2008-04-24 Thread Geoffrey Young



Otis Gospodnetic wrote:

Not in one place and documented.  The place to look are query parsers, but 
things like AND OR NOT TO are the ones to look out for.


this seems like something solr ought to handle gracefully on the backend 
for me - if I need to write logic to make sure a malicious query for 
AND NOT all by itself (in all caps) doesn't make solr implode then so 
does everyone else...


--Geoff


another spellchecker question

2008-04-23 Thread Geoffrey Young

hi :)

I've noticed that (with solr 1.2) the returned order (as well as the 
actual matched set) is affected by the number of matches you ask for:


  q=hannasuggestionCount=1
suggestions:[Yanna]

  q=hannasuggestionCount=2
suggestions:[Manna,
  Yanna]

  q=hannasuggestionCount=5
suggestions:[Manna,
  Nanna,
  Sanna,
  Vanna,
  Shanna]

note how the #1 result is completely missing from the top 5... or at 
least that's how I _used_ to think about the sets :)


unfortunately, extendedresults seems to be a 1.3-only option, so I can't 
see what's going on here.  but I guess I'm asking if this is expected 
behavior.


--Geoff


Re: another spellchecker question

2008-04-23 Thread Geoffrey Young



Shalin Shekhar Mangar wrote:

Hi Geoffrey,
Yes, this is a caveat in the lucene contrib spellchecker which Solr uses.

From the lucene spell checker javadocs:


* pAs the Lucene similarity that is used to fetch the most relevant
n-grammed terms
   * is not the same as the edit distance strategy used to calculate the
best
   * matching spell-checked word from the hits that Lucene found, one
usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * pI.e. if numSug == 1, don't count on that suggestion being the best
one.
   * Thus, you should set this value to bat least/b 5 for a good
suggestion.

Therefore what you're seeing is by design. Probably we should change the
default number of suggestions when querying lucene spellchecker to 5 and
give back the top result if the user asks for only one suggestion from solr.


great, thanks for all that - I'm still trying to figure out where all 
the relevant docs live.  you've been really helpful.


--Geoff


Re: config for very frequent solr updates

2008-04-18 Thread Geoffrey Young



Otis Gospodnetic wrote:

Geoff,

There was just another thread where the person said he was doing
updates every 2 minutes.  


ok, I see that now.  unfortunately, the data is sparse there :)


Like you said, with the way Solr warms
searchers, this could be too frequent for instances with large caches
and high autowarmCount.


ok, thanks.

I'll have a better sense of the size of my data soon, but I suspect it's 
nowhere near on the scale of most of the people here - maybe a million 
documents, tops.  right now I'm proof-of-concept'ing nearly all our data 
(but in a single language) and it's 500K documents with an index of 100M :)




You may be better off playing with the combination of larger older
index and a smaller index with updates kept in RAM (on the slave, of
course).


good info, thanks.

sorry for the basic questions.  and thanks for the (later) pointer to 
solr-303 - I found the distributed search docs from there and will keep 
that in mind as I move forward.


--Geoff




Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message  From: Geoffrey Young
[EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent:
Thursday, April 17, 2008 8:28:09 AM Subject: config for very frequent
solr updates

hi all :)

I didn't see any documentation on this, so I was wondering what the 
experience here was with updating solr with a small but constant

trickle of daemon-style updates.

unfortunately, it's a business requirement that backend db updates
make it to search as the changes roll in (5 minutes is out of the
question). with the way solr handles new searchers, warming, etc, I
was wondering if anyone had experience with this kind of thing and
could share thoughts/general config stuff on it.

thanks.

--Geoff




config for very frequent solr updates

2008-04-17 Thread Geoffrey Young

hi all :)

I didn't see any documentation on this, so I was wondering what the 
experience here was with updating solr with a small but constant trickle 
of daemon-style updates.


unfortunately, it's a business requirement that backend db updates make 
it to search as the changes roll in (5 minutes is out of the question). 
 with the way solr handles new searchers, warming, etc, I was wondering 
if anyone had experience with this kind of thing and could share 
thoughts/general config stuff on it.


thanks.

--Geoff


Re: schema help

2008-03-12 Thread Geoffrey Young



Rachel McConnell wrote:

Our Solr use consists of several rather different data types, some of
which have one-to-many relationships with other types.  We don't need
to do any searching of quite the kind you describe, but I have an idea
about it, depending on what you need to do with the book data.  It is
rather hacky, but maybe you can improve it.


coolio, thanks :)

[snip]



If your 'authors' 'write' 'books' with great frequency, you'd need to
update a lot...


yeah, unfortunately that's the case :)

I was using the book analogy because I figured it was simple to explain, 
not necessarily because I was trying to be vague :)



Another possibility is to do two searches, with this kind of
structure, which sort of mimics an RDBMS:
* everything in Solr has a field, type (book, author, library, etc).
these can be filtered on a search by search basis
* books have a field, authorId, uniquely referencing the author
* your first search will restricted to just authors, from which you
will extract the IDs.
* your second search will be restricted to just books, whose authorId
field is exactly one of the IDs from the first search


I think this approach solves the mindset issues I was having - I didn't 
want to be left with a schema like this


  authorId
  bookID1
  bookID2
  ...

but since lucene allows for all kinds of slots to exist and be empty, it 
seems I can simplify that to


  authorId
  bookId

and use multiple queries to satisfy the display needs.  it's probably 
more a duh! moment for the majority, but lucene is sufficiently 
different from what I'm used to that it's taking me a bit of time :)




As you have noticed, Lucene is not an RDBMS.  Searching through all
the text of all the books is more the use it was designed around; of
course the analogy might not be THAT strong with your need!


I think the fulltext search capabilities will serve us well for some 
aspects of our search needs.  the stemming, language, and other filters 
will definitely be a help to just about everything we do.


speaking of language, this is my last question for now...

what's the idiomatic way to represent multiple languages?  left to my 
own devices I'd probably do something like


   name_en-us
   name-es-us

anyway, thanks so much for your help.

--Geoff


Re: schema help

2008-03-12 Thread Geoffrey Young



the trouble I'm having is one of dimension.  an author has many, many
 attributes (name, birthdate, biography in $language, etc).  as does
each book (title in $language, summary in $language, genre, etc).  as
does each library (name, address, directions in $language, etc).  so
an author with N books doesn't seem to scale very well in the flat 
representations I'm finding in all the lucene/solr docs and

examples... at least not in some way I can wrap my head around.

OG: I'm not sure why the number of attributes worries you.  Imagine
is as a wide RDBMS table, if it helps.  Indices with dozens of fields
are not uncommon.


it's not necessarily the number of fields, it's the Attribute1 .. 
AttributeN-style numbering that worries me.  but I think it's all 
starting to make sense now... if wanting to pull data in multiple 
queries was my holdup.



OG: You certainly can do that.  I'm not sure I understand where the
hard part is.  You seem to know what attributes each entity has.
Maybe you are confused by how to handle N different types of entities
in a single index? 


yes... or, more properly, how to relate them to eachother.

I understand that the schema can hold tons of attributes that are unused 
in different documents.  my question seems to be how to organize my data 
such that I can answer the question how do I get a list of libraries 
with $book like $pattern - where does the de-normalization typically 
occur?  if a document fully represents a book by an author in a 
library such that the same book (with all it's attributes) is in my 
index multiple times (one for each library) how do I drill down to 
showing just the directions to a specific library?



(I'm assuming a single index is what you currently
have in mind)


using different indices is what my lucene+compass counterparts are 
doing.  I couldn't find an example of that in the solr docs (unless the 
answer is running multiple, distinct instances at the same time)



eew :)  seriously, though, that's what we have now - all rdbms
driven. if solr could only conceptually handle the initial lookup
there wouldn't be much point.

OG: Well, there might or might not be, depending on how much data you
have, how flexible and fast your RDBMS-powered (full-text?) search,
and so on.  The Lucene/Solr for full-text search + RDBMS/BDB for
display data is a common combination.


the decision has been made to use lucene to replace all rdbms 
functionality for search


*cough*

:)



maybe I'm thinking about this all wrong (as is to be expected :), but
I just can't believe that nobody is using solr to represent data a
bit more complex than the examples out there.

OG: Oh, lots of people are, it's just that examples are simple, so
people new to Solr, Lucene, etc. have easier time learning.


:)

thanks for your help here.

--Geoff


schema help

2008-03-11 Thread Geoffrey Young

hi :)

I'm trying to work out a schema for our widgets.  more than just coming 
up with something I'd like something idiomatic in solr terms.  any help 
is much appreciated.  here's a similar problem space to what I'm working 
with...


lets say we're talking books.  books are written by authors and held in 
libraries.  a sister company is using lucene+compass and they seem to 
have completely different collections (or whatever the technical term is :)


  authors
  books
  libraries

so that a search for authors hits only the authors dataset.

all of the solr examples I can find don't seem to address this kind of 
data disparity.  what is the standard and idiomatic approach for solr?


for my particular data I'd want to display something like this

  author
book in library
book in library

on the same result page, but using a completely flat, single schema 
doesn't seem to scale very well.


collective widsom most welcome :)

--Geoff


Re: schema help

2008-03-11 Thread Geoffrey Young



Otis Gospodnetic wrote:

Geoff,

I'm not sure if I understood your problem correctly, but it sounds
like you want your search to be restricted to authors, but then you
want to list all of his/her books when displaying results. 


that's about right.  add that I may also want to search on libraries and 
show all the books (and authors) stored there.


in real life, it's not books or authors, of course, but the parallels 
are close enough :)  in fact, the library example is a good one for 
me... or at least a network of public libraries linked together.



The
easiest thing to do would be to create an index where each
row/Document has the author name, the book title, etc.  For each
author-matching Document you'd pull his/her books out of the result
set.  Yes, this means the author name would be denormalized in
RDBMS-speak.  


I think I can live with the denormalization - it seems lucene is flat 
and very different conceptually than a database :)


the trouble I'm having is one of dimension.  an author has many, many 
attributes (name, birthdate, biography in $language, etc).  as does each 
book (title in $language, summary in $language, genre, etc).  as does 
each library (name, address, directions in $language, etc).  so an 
author with N books doesn't seem to scale very well in the flat 
representations I'm finding in all the lucene/solr docs and examples... 
at least not in some way I can wrap my head around.


part of what seemed really appealing about lucene in general was that 
you could stuff all this (unindexed) information into a document and 
retrieve it all based on some search criteria.  but it's seeming very 
difficult for me to wrap my head around the data I need to represent.



Another option is not to index/store book titles, but
rather have only an author index to search against.  The book data
(mapped to author identities) would then be pulled from an external
source (e.g. RDBMS: select title from books where author_id in
(1,2,3)) at search results display time.


eew :)  seriously, though, that's what we have now - all rdbms driven. 
if solr could only conceptually handle the initial lookup there wouldn't 
be much point.


maybe I'm thinking about this all wrong (as is to be expected :), but I 
just can't believe that nobody is using solr to represent data a bit 
more complex than the examples out there.


thanks for the feedback.

--Geoff



Otis

-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message  From: Geoffrey Young
[EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent:
Tuesday, March 11, 2008 12:17:32 PM Subject: schema help

hi :)

I'm trying to work out a schema for our widgets.  more than just
coming up with something I'd like something idiomatic in solr terms.
any help is much appreciated.  here's a similar problem space to what
I'm working with...

lets say we're talking books.  books are written by authors and held
in libraries.  a sister company is using lucene+compass and they seem
to have completely different collections (or whatever the technical
term is :)

authors books libraries

so that a search for authors hits only the authors dataset.

all of the solr examples I can find don't seem to address this kind
of data disparity.  what is the standard and idiomatic approach for
solr?

for my particular data I'd want to display something like this

author book in library book in library

on the same result page, but using a completely flat, single schema 
doesn't seem to scale very well.


collective widsom most welcome :)

--Geoff




multiple things in a document

2008-02-22 Thread Geoffrey Young

hi all :)

I'm just getting up to speed with solr (and lucene, for that matter) for 
a new project.  after reading through the available docs I'm not finding 
an answer to my most basic (newbie, certainly) question.  please feel 
free to just point me to the proper doc :)


this isn't my actual use case, but it's close enough for general 
understanding... say I want to store data on a collection of SKUs which 
(for the unfamiliar :) are a combination of item + location.  so we 
might have


  sku
id
name
item
location

  item
id
name

  location
id
name

all of the schema.xml examples seem to deal with just a flat thing 
perhaps with multiple entries of the same field.  what I'm after is how 
to represent this kind of relationship in the schema, such that I can 
limit my result set to, say, a sku or item, but if I search on sku I can 
discriminate between the sku name and the item name in my results.


from my reading on lucene this is pretty basic stuff, but I don't see 
how the solr layer approaches this at all.  again, doc pointers much 
appreciated.


thanks for listening :)

--Geoff