Cross-context-forward to solr-instance

2008-09-06 Thread Hachmann, Bjoern
Hi, 
 
yesterday I tried the Solr-1.3-RC2 and everything seems to work fine using the 
traditional single-core setup. But while troubleshooting the new multi-core 
feature, I realized for the first time, that I have been using the deprecated 
(even in 1.2) class SolrServlet. This is a huge problem for us, as we run the 
solr-web-app parallel to our main web-app in the same servlet-container. Using 
this approach we can internally forward update- and select-requests to the 
Solr-instance currently in use. 
 
ServletContext ctx = getServletContext().getContext(solr1);
RequestDispatcher rd = ctx.getNamedDispatcher(SolrServer);
rd.forward(request, response);

As you can see, this approach only works for the servlet named 'SolrServer' 
which references the deprecated class. 

The attempt of using a path based dispatcher (ctx.getRequestDispatcher) was not 
successful, even though I configured the SolrRequestFilter in the solr-web.xml 
to work on forwards (dispatcherFORWARD/dispatcher), which the documentation 
discourages. Maybe this is because of the cross-context-dispatch?

At the moment I ran totally out of ideas, apart from completely redesigning our 
whole setup. Any ideas are highly appreciated. 

Thanks in advance,
Björn

 

 

 

 

 



Re: Replacing FAST functionality at sesam.no

2008-09-06 Thread Mck
 but Mick Semb Wever will be taking over this job for the next two weeks.

back from holidays and taking over where Glenn-Erik left. i'm very new
to Solr so please bear with me, 

i'll run through our setup from scratch.

Our test list has 9 entries:
 abcd efgh ijkl, abcd efgh, efgh ijkl, abcd, efgh, ijkl,
ijkl efgh, efgh abcd, and ijkl efgh abcd.

I'm using a trunk build of Solr, and using the example/solr for the solr
home.

Editing schema.xml so to put these entries in as type=string and using
defaultOperator=OR gives the expected exact matching functionality
given queries are quoted, eg /solr/select/?q=abcd efgh ijkl

So then i change type=string to type=shingleString along with

 fieldType name=shingleString class=solr.StrField 
 positionIncrementGap=100 omitNorms=true 
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ShingleFilterFactory outputUnigrams=true 
 outputUnigramIfNoNgram=true maxShingleSize=99 /
   /analyzer
 /fieldType

I never get any hits with quoted queries.
Without quotes i only get the unigrams.

I get the same outcomes using: 
[EMAIL PROTECTED]solr.TextField and 
in the index analyzer [EMAIL PROTECTED]solr.KeywordTokenizerFactory.

In fact the ShingleFilter does nothing at all here, commenting the
filter line out leads exactly the same behaviour.

What am i missing to get shingles actually matching the indexed entries?
  It seems to be if this was solved it would work without having to use
quoted queries.

I have been using the analysis.jsp tool
Everything looks good except that quotes are captured into the words and
shingles, eg

 term position 12   3 
 term text abcdefghijkl 
   abcdefgh efgh ijkl 
   abcd efgh ijkl

This would explain why quoted queries are not working - the
ShingleFilter produces tokens with the  character in it. But here i
would have atleast expected a hit against efgh

~mck

-- 
He who joyfully marches to music in rank and file has already earned my
contempt. He has been given a large brain by mistake, since for him the
spinal cord would suffice. Albert Einstein 
| semb.wever.org | sesat.no | sesam.no |


signature.asc
Description: This is a digitally signed message part


UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Brian Whitman
Trying to build a simple UpdateRequestProcessor that keeps a field  
(the time of original index) when overwriting a document.


1) Can I make a updateRequestProcessor chain only work as a certain  
handler or does putting the following in my solrconfig.xml:


 updateRequestProcessorChain
processor class=myspecial.KeepIndexedDateFactory 
   processor class=solr.RunUpdateProcessorFactory /
   processor class=solr.LogUpdateProcessorFactory /
 /updateRequestProcessorChain

Just handle all document updates?

2) Does a UpdateRequestProcessor support inform ?






Re: UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Brian Whitman

Answered my own qs, I think:


Trying to build a simple UpdateRequestProcessor that keeps a field  
(the time of original index) when overwriting a document.


1) Can I make a updateRequestProcessor chain only work as a certain  
handler or does putting the following in my solrconfig.xml:


updateRequestProcessorChain
   processor class=myspecial.KeepIndexedDateFactory 
  processor class=solr.RunUpdateProcessorFactory /
  processor class=solr.LogUpdateProcessorFactory /
/updateRequestProcessorChain

Just handle all document updates?




What you have to do is:

  requestHandler name=/update2  
class=solr.XmlUpdateRequestHandler 

lst name=invariants
 str name=update.processorKeepIndexed/str
/lst
  /requestHandler

updateRequestProcessorChain name=KeepIndexed
processor class=myspecial.KeepIndexedDateFactory/
   processor class=solr.RunUpdateProcessorFactory /
   processor class=solr.LogUpdateProcessorFactory /
/updateRequestProcessorChain

And then calls to /update2 will go through the chain. Calls to /update  
will not.





2) Does a UpdateRequestProcessor support inform ?



No, not that I can tell. And the factory won't get instantiated until  
the first time you use it.








Re: Faceting MoreLikeThisComponent results

2008-09-06 Thread Chris Hostetter

: When using the MoreLikeThisHandler with facets turned on, the facets show
: counts of things that are more like my original document. When I use the
: MoreLikeThisComponent, the facets show counts of things that match my
: original document (I'm querying by document ID), so there is only one
...
: How can I facet the results of the MoreLikeThisComponent?

I don't think you can at this point.  The good news is MoreLikeThisHandler 
isn't getting removed anytime soon.


What we need to do is provide more options on the componets to dictate 
their behavior when deciding what to process and how to return it ... your 
example could be solved be either adding an option to MLTComponent telling 
it to overwrite hte main result set; or by adding an option to 
FacetComponent specifying the name of a DocSet in the response to use in 
it's intersections.

I think it would be good to do both.

(HighlightComponent should probably also have an option just like the one 
i discribed for FacetComponent)

Would you mind filing a feature request?


-Hoss



Re: scoring individual values in a multivalued field

2008-09-06 Thread Chris Hostetter
: I have a multivalued field that I would want to score individually for each
: value. Is there an easy way to do that?

Lucene-Java has a (somewhat new) feature called Payloads which allows 
for things like this built arround the idea that when indexing, any Token 
cn contain an arbitrary data payload which is persisted along with the 
TermPosition info in the index -- At query time, different types of 
queries can use/abuse that payload anyway they want.

Currently payload support in Solr is somewhat limited.  If you have a 
custom Analyzer or Tokenizer/TokenFilter that knows about Payloads, they 
will make it into the index, but you would need to write a custom 
Similiarty and QParserPlugin to take advantage of it (there's already a 
BoostingTermQuery in Lucene that you can leverage)

Payloads is a really powerful feature, but the fact that it can be used in 
*so* many different ways is probably the biggest reasons why Solr 
doesn't have any features yet to make payloads easier to use just via 
configuration.

At the moment, the simplest mechanisms for achieving something like what 
you are describing that i know of are:
  1) repetitive values.  Add a value twice to make it counnt (roughly) 
 twice as much. (eliminating lengthNorm and customing your Similarity 
 is neccessary to make it worth exactly twice as much)
  2) differnet fields.  Partition the spectrum of importance for your 
 values into N buckets, make a field for each bucket, put the value in 
 the bucket that makes the most sense, and at query time query ofr 
 each bucket with a differnet query time boost.

: 2) the value of normField is persisted as a byte in the index and the
: precision loss hurts.

for a field like what you are describing, you'll probably want to 
omitNorms completley just to make sure docs with lots of values aren't 
penalized.



-Hoss



Re: scoring individual values in a multivalued field

2008-09-06 Thread Chris Hostetter

: I ran into the same problem some time ago, couldn't find any relation to the
: boost values on the multivalued field and the search results. Does anybody

as the OP mentioned, the index time boost values for a field are per field 
*name* not per value ... they all get folded in together into hte 
fieldNorm for that field name in that document.



-Hoss



Re: Synonyms and stemming revisited

2008-09-06 Thread Chris Hostetter

: I see two solutions:
: 
: Either put all possible endings in the synonym file - I do not really
: like this solution, as it would make the file very large, and it also
: is too easy to miss some specific ending. Or run the stemmer before
: the synonym filter, in which case the synonym definitions need to
: appear in their stemmed forms. Am I missing something, or does the

Based on my understanding of your description of your problem, i think i 
agree with you.

If i've given differnet advice in the past, I'm sure i had a good reason 
for -- possible due to some aspect of those problems that are subtly 
differnet then yours ... can you post links to hte specific messages 
you're refering to, it might help jog my memory.

: conversion of the synonym text file need to be done by hand at the
: moment? I suppose that it would not be too difficult to write some

A recently added feature is that when configuring SynonymFilterFactory 
you can give it the name of a TokenizerFactory to use when parsing the 
synonym file.  This could be used to stem words *if* you write a 
TokenizerFactory that calls out to your Stemmer.

(see SOLR-319 for the backround on why you can only specify a Tokenizer 
and not a full fieldType to get the analysis chain from ... in a 
nutshell: 1. it would have been harder to implement; 2. the only use cases 
people could think of where Tokenization based.)


-Hoss



Re: UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Chris Hostetter

: And then calls to /update2 will go through the chain. Calls to /update will
: not.

Correct.  Note also that there is also a default attribute you can put 
on one UpdateProcesserChain and then XmlUpdateRequestHandler (etc...) will use 
that even if you don't tell them to use a particular chain.  If you only 
define one chain, it becomes the default automaticly.

:  2) Does a UpdateRequestProcessor support inform ?

: No, not that I can tell. And the factory won't get instantiated until the
: first time you use it.

inform, no ... but the factories should be getting instantiated during 
SolrCore init, what makes you think it's not until first use?  (that would 
be a bug if it's true, but a quick skim of SolrCore suggests it should be 
working correctly)




-Hoss



Re: handling multiple multiple resources with single requestHandler

2008-09-06 Thread Chris Hostetter

: Any ideas on how could we register single request handler for handling
: multiple (wildcarded) contexts/resource uri's ?
: 
: (something like) :
: 
: requestHandler name=/app/* class=solr.StandardRequestHandler 
: requestHandler name=/app/*/query class=solr.StandardRequestHandler

One of the reasons wildcards aren't supported is because it creates 
ambiguity when dealing with dynamicly created RequestHandlers.

Once upon a time we had the notion that a : (colon) could be used in the 
query path to denote that SolrDispatchFilter should stop there and treat 
everything up to the colon as the handler name, while everything after the 
colon should be put in the SolrQueryRequest for use by the RequestHandler, 
ie...
   /app/query?q=solr
   /app/query:yakko/foo/yak?q=solr
   /app/query:dot/bar/hoss?q=solr
...would all get processed by the /app/query handler which would have 
access to the , yakko/foo/yak, and dot/bar/hoss parts for each 
request.

That seems to have been removed from SOlrDispatchFilter at some point, I'm 
not clear why but there are clearly remnents of it so maybe it was a 
mistake...

// unused feature ?
int idx = path.indexOf( ':' );
if( idx  0 ) {
  // save the portion after the ':' for a 'handler' path parameter
  path = path.substring( 0, idx );
}

...i'm kind of tired right now, but if i'm reading that correctly it's 
flat out ignoring anything after the colon. (which seems like the worst of 
both worlds ... you can't have a : in your request handler name, but you 
can't have access to what comes after it if you put it in the URL)

I'm Not sure what's going on there.  Maybe someone else understands.

: The only way I can do it right now is by modifying SolrDispatchFilter, and
: manually adding request context trimming there (reducing the requested context
: to /app/), and registering handler for that context (which would later
: resolve other parts of it) - but if there is another way to do this -
: without changing the code, I would be more than happy to learn about it :)

if you're comfortable with ServletFilters enough to muck with 
SolrDispatchFilter, then wouldn't writing a new filter that you configure 
to sit in front of SolrDispatchFilter and take pieces out of the URL and 
add them as request params be just as easy to write (and a lot easier to 
maintain) ?


-Hoss



Re: Questions on compound file format

2008-09-06 Thread Chris Hostetter

: 1. Using the compound file format drops the number of file descriptors
: needed. Any other benefits?

not that i know of.

: 2. Indexing may be slower. What about query performance?

If i remember correctly it's a little slower, but a little may be 
inconsequential.

: 3. Since Lucene 1.4, the compound file format became the default, however
: Solr default is not to use compound file format. Why this inconsistency?

SolrIndexConfig.java shows useCompoundFile = true as the defualt ... are 
you seeing something different getting used as the default somewhere?


-Hoss



Re: UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Shalin Shekhar Mangar
On Sun, Sep 7, 2008 at 11:00 AM, Chris Hostetter
[EMAIL PROTECTED]wrote:


 inform, no ... but the factories should be getting instantiated during
 SolrCore init, what makes you think it's not until first use?  (that would
 be a bug if it's true, but a quick skim of SolrCore suggests it should be
 working correctly)


I think Brian is referring to the method
UpdateRequestProcessorFactory#getInstance(SolrQueryRequest,
SolrQueryResponse, UpdateRequestProcessor) which kinda limits you to create
it only on first request as an API. Noble pointed this out in SOLR-660 (but
after it was committed) --
https://issues.apache.org/jira/browse/SOLR-660?focusedCommentId=12617235#action_12617235

-- 
Regards,
Shalin Shekhar Mangar.


Re: Questions on compound file format

2008-09-06 Thread Shalin Shekhar Mangar
On Sun, Sep 7, 2008 at 11:21 AM, Chris Hostetter
[EMAIL PROTECTED]wrote:


 SolrIndexConfig.java shows useCompoundFile = true as the defualt ... are
 you seeing something different getting used as the default somewhere?


The example solrconfig.xml has useCompoundFile as false both in the
indexDefault as well as in mainIndex section. Should we change that?

-- 
Regards,
Shalin Shekhar Mangar.