Re: Use case for the Shingle Filter

2017-03-05 Thread Ryan Josal
I thought new versions of solr didn't split on whitespace at the query
parser anymore, so this should work?

That being said, I think I remember it having a problem coming after a
synonym filter.  IIRC, if your input is "Foo Bar" and you have a synonym
"foo <=> baz" you would get foobaz bazbar instead of foobar and bazbar.  I
wrote a custom shingler to account for that.

Ryan

On Sun, Mar 5, 2017 at 02:48 Markus Jelsma 
wrote:

> Hello - we use it for text classification and online near-duplicate
> document detection/filtering. Using shingles means you want to consider
> order in the text. It is analogous to using bigrams and trigrams when doing
> language detection, you cannot distinguish between Danish and Norwegian
> solely on single characters.
>
> Markus
>
>
>
> -Original message-
> > From:Ryan Yacyshyn 
> > Sent: Sunday 5th March 2017 5:57
> > To: solr-user@lucene.apache.org
> > Subject: Use case for the Shingle Filter
> >
> > Hi everyone,
> >
> > I was thinking of using the Shingle Filter to help solve an issue I'm
> > facing. I can see this working in the analysis panel in the Solr admin,
> but
> > not when I make my queries.
> >
> > I find out it's because of the query parser splitting up the tokens on
> > white space before passing them along.
> >
> > This made me wonder what a practical use case can be, for using the
> shingle
> > filter?
> >
> > Any enlightenment on this would be much appreciated!
> >
> > Thanks,
> > Ryan
> >
>


Forking Solr

2015-10-16 Thread Ryan Josal
Hi guys, I'd like to get your tips on how to run a Solr fork at my
company.  I know Yonik has a "heliosearch" fork, and I'm sure many others
have a fork.  There have been times where I want to add features to an
existing core plugin, and subclassing isn't possible so I end up copying
the source code into my repo, then using some crazy reflection to get it to
work.  Sometimes there's a little bug in something and I have to do the
same thing.  Sometimes there's something I want to do deeper in core Solr
code that isn't pluggable and I end up doing an interesting workaround.
Sometimes I want to apply a patch from JIRA.  I also think forking solr
will make it easier for me to contribute patches back.  So here are my
questions:

*) how do I properly fork it outside of github to my own company's git
system?
*) how do I pull new changes?  I think I would expect to sync new changes
when there is a new public release.  What branches do I need to work
with/on?
*) how do I test my changes?  What part of the test suites do I run for
what changes?
*) how do I build a new version when I'm ready to go to prod?  This is
slightly more unclear to me now that it isn't just a war.

Thanks,
Ryan


Re: Forking Solr

2015-10-16 Thread Ryan Josal
Thanks for the feedback, forking lucene/solr is my last resort indeed.

1) It's not about creating fresh new plugins.  It's about modifying
existing ones or core solr code.
2) I want to submit the patch to modify core solr or lucene code, but I
also want to run it in prod before its accepted and released publicly.
Also I think this helps solidify the patch over time.
3) I have to do this all the time, and I agree it's better than forking,
but doing this repeatedly over time has diminishing returns because it
increases the cost of upgrading solr.  I also requires some ugly reflection
in most cases, and in others copying verbatim a pile of other classes.

I will send my questions to lucene-dev, thanks!
Ryan

On Friday, October 16, 2015, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Ryan,
>
> From a "solr-user" perspective :) I would advise against forking Solr. Some
> of our consulting business is "people who forked Solr, need to upgrade, and
> now have gotten themselves into hot water."
>
> I would try, in the following order
> 1. Creating a plugin (sounds like you can't do this)
> 2. Submitting a patch to Solr that makes it easier to create the plugin you
> need
> 3. Copy-pasting code to create a plugin. I once had to do this for a
> highlighter. It's ugly, but its better than forking.
> 4
> 999. Hiring Yonik :)
> 1000. Forking Solr
>
> 999 a prereq for 1000 :)
>
> Even very heavily customized versions of Solr sold by major vendors that
> staff committers are entirely plugin driven.
>
> Cheers
> -Doug
>
>
> On Fri, Oct 16, 2015 at 3:30 PM, Alexandre Rafalovitch <arafa...@gmail.com
> <javascript:;>>
> wrote:
>
> > I suspect these questions should go the Lucene Dev list instead. This
> > one is more for those who build on top of standard Solr.
> >
> > Regards,
> >Alex.
> >
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 16 October 2015 at 12:07, Ryan Josal <r...@josal.com <javascript:;>>
> wrote:
> > > Hi guys, I'd like to get your tips on how to run a Solr fork at my
> > > company.  I know Yonik has a "heliosearch" fork, and I'm sure many
> others
> > > have a fork.  There have been times where I want to add features to an
> > > existing core plugin, and subclassing isn't possible so I end up
> copying
> > > the source code into my repo, then using some crazy reflection to get
> it
> > to
> > > work.  Sometimes there's a little bug in something and I have to do the
> > > same thing.  Sometimes there's something I want to do deeper in core
> Solr
> > > code that isn't pluggable and I end up doing an interesting workaround.
> > > Sometimes I want to apply a patch from JIRA.  I also think forking solr
> > > will make it easier for me to contribute patches back.  So here are my
> > > questions:
> > >
> > > *) how do I properly fork it outside of github to my own company's git
> > > system?
> > > *) how do I pull new changes?  I think I would expect to sync new
> changes
> > > when there is a new public release.  What branches do I need to work
> > > with/on?
> > > *) how do I test my changes?  What part of the test suites do I run for
> > > what changes?
> > > *) how do I build a new version when I'm ready to go to prod?  This is
> > > slightly more unclear to me now that it isn't just a war.
> > >
> > > Thanks,
> > > Ryan
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: Solr cross core join special condition

2015-10-07 Thread Ryan Josal
I developed a join transformer plugin that did that (although it didn't
flatten the results like that).  The one thing that was painful about it is
that the TextResponseWriter has references to both the IndexSchema and
SolrReturnFields objects for the primary core.  So when you add a
SolrDocument from another core it returned the wrong fields.  I worked
around that by transforming the SolrDocument to a NamedList.  Then when it
gets to processing the IndexableFields it uses the wrong IndexSchema, I
worked around that by transforming each field to a hard Java object
(through the IndexSchema and FieldType of the correct core).  I think it
would be great to patch TextResponseWriter with multi core writing
abilities, but there is one question, how can it tell which core a
SolrDocument or IndexableField is from?  Seems we'd have to add an
attribute for that.

The other possibly simpler thing to do is execute the join at index time
with an update processor.

Ryan

On Tuesday, October 6, 2015, Mikhail Khludnev 
wrote:

> On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > wrote:
>
> > it
> > seems there is not any way to do that right now and it should be
> developed
> > somehow. Am I right?
> >
>
> yep
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> >
>


Bug in query elevation transformers SOLR-7953

2015-08-20 Thread Ryan Josal
Hey guys, I just logged this bug and I wanted to raise awareness.  If you
use the QueryElevationComponent, and ask for fl=[elevated], you'll get only
false if solr is using LazyDocuments.  This looks even stranger when you
request exclusive=true and you only get back elevated documents, and they
all say false.  I'm not sure how often LazyDocuments are used, but it's
probably not an uncommon issue.

Ryan


Re: rq breaks wildcard search?

2015-04-22 Thread Ryan Josal
Awesome thanks!  I was on 4.10.2

Ryan

 On Apr 22, 2015, at 16:44, Joel Bernstein joels...@gmail.com wrote:
 
 For your own implementation you'll need to implement the following methods:
 
 public Query rewrite(IndexReader reader) throws IOException
 public void extractTerms(SetTerm terms)
 
 You can review the 4.10.3 version of the ReRankQParserPlugin to see how it
 implements these methods.
 
 Joel Bernstein
 http://joelsolr.blogspot.com/
 
 On Wed, Apr 22, 2015 at 7:33 PM, Joel Bernstein joels...@gmail.com wrote:
 
 Just confirmed that wildcard queries work with Re-Ranking following
 SOLR-6323.
 
 Joel Bernstein
 http://joelsolr.blogspot.com/
 
 On Wed, Apr 22, 2015 at 7:26 PM, Joel Bernstein joels...@gmail.com
 wrote:
 
 This should be resolved in
 https://issues.apache.org/jira/browse/SOLR-6323.
 
 Solr 4.10.3
 
 Joel Bernstein
 http://joelsolr.blogspot.com/
 
 On Wed, Apr 15, 2015 at 6:23 PM, Ryan Josal rjo...@gmail.com wrote:
 
 Using edismax, supplying a rq= param, like {!rerank ...} is causing an
 UnsupportedOperationException because the Query doesn't implement
 createWeight.  This is for WildcardQuery in particular.  From some
 preliminary debugging it looks like without rq, somehow the qf Queries
 might turn into ConstantScore instead of WildcardQuery.  I don't think
 this
 is related to the RankQuery implementation as my own subclass has the
 same
 issue.  Anyway the effect is that all q's containing ? or * return http
 500
 because I always have rq on.  Can anyone confirm if this is a bug?  I
 will
 log it in Jira if so.
 
 Also, does anyone know how I can work around it?  Specifically, can I
 disable edismax from making WildcardQueries?
 
 Ryan
 


rq breaks wildcard search?

2015-04-15 Thread Ryan Josal
Using edismax, supplying a rq= param, like {!rerank ...} is causing an
UnsupportedOperationException because the Query doesn't implement
createWeight.  This is for WildcardQuery in particular.  From some
preliminary debugging it looks like without rq, somehow the qf Queries
might turn into ConstantScore instead of WildcardQuery.  I don't think this
is related to the RankQuery implementation as my own subclass has the same
issue.  Anyway the effect is that all q's containing ? or * return http 500
because I always have rq on.  Can anyone confirm if this is a bug?  I will
log it in Jira if so.

Also, does anyone know how I can work around it?  Specifically, can I
disable edismax from making WildcardQueries?

Ryan


Re: Group by score

2015-04-09 Thread Ryan Josal
You can use Result Grouping by a function using query(), but you'll need a
version of Lucene with this bug fixed:

https://issues.apache.org/jira/browse/SOLR-7046

Ryan

On Thursday, April 9, 2015, Jens Mayer mjen...@yahoo.com.invalid wrote:

 Hey everybody,
 I have the following situation in my search application: I've been
 searching street sources. By executing a search I receive several matches.
 The first 10 matches are displayed. But in this situation a part of the
 results are nearly the same.As example if I seach for Berlin I'll receive
 every zip as a single set of data in my top 10. But my application only
 shows the zip if you explicitly search for this e.g.14089 Berlin. So if you
 only search for Berlin you receive ten times Berlin without zip as result.
 So I like to avoid this situation.I've take a look about my results and it
 come to my attention that these set of data have every time the same score.
 At the following I would implement a grouping by score. But seeming solr
 don't know this field. But I wondering about the fact that inside a
 grouped set of data after all I can sort by score.
 Have someone an Idea how I can resolve this?
 Greetings



Re: omitTermFreqAndPositions issue

2015-04-09 Thread Ryan Josal
Thanks a lot Erick, your suggestion on using similarity will work great; I
wasn't aware you could define similarity on a field by field basis until
now, and that solution works perfectly.

Sorry what I said was a little misleading. I should have said I don't want
it to issue phrase queries to that specific field ever, because it has
positions turned off and so phrase queries cause exceptions.  Because I DO
want to run phrase queries on the title data, I just have another field
for that.

The problem is the one described here:
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/

It still seems a bit off that you can't use an omitTermFreqAndPositions
field with edismax's qf; but I can't think of a situation that defining a
custom similarity wouldn't be the right solution.

Thanks again,
Ryan

On Wed, Apr 8, 2015 at 5:29 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Ryan:

 bq:  I don't want it to issue phrase queries to that field ever

 This is one of those requirements that you'd have to enforce at the
 app layer. Having Solr (or Lucene) enforce a rule like this for
 everyone would be terrible.

 So if you're turning off TF but also saying title is one of the
 primary components of score. Since TF in integral to calculating
 scores, I'm not quite sure what that means.

 You could write a custom similarity class that returns whatever you
 want (1.0 comes to mind) from the tf() method.

 Best,
 Erick

 On Wed, Apr 8, 2015 at 4:50 PM, Ryan Josal rjo...@gmail.com wrote:
  Thanks for your thought Shawn, I don't think fq will be helpful here.
 The
  field for which I want to turn TF off is title, which is actually one
 of
  the primary components of score, so I really need it in qf.  I just don't
  want the TF portion of the score for that field only.  I don't want it to
  issue phrase queries to that field ever, but if the user quotes
 something,
  it does, and I don't know how to make it stop.  To me it seems
 potentially
  more appropriate to send that to the pf fields, although I can think of a
  couple good reasons to put it against qf.  That's fine as long as it
  doesn't try to build a phrase query against a no TF no pos field.
 
  Ryan
 
  On Wednesday, April 8, 2015, Shawn Heisey apa...@elyograg.org wrote:
 
  On 4/8/2015 5:06 PM, Ryan Josal wrote:
   The error:
   IllegalStateException: field foo indexed without position data;
 cannot
   run PhraseQuery.
  
   It would actually be ok for us to index position data but there isn't
 an
   option for that without term frequencies.  No TF is important for us
 when
   it comes to searching product titles.
  
   I should say that only a small fraction of user queries contained
 quoted
   phrases that trigger this error, so it works much of the time, but
 we'd
   also like to continue supporting user quoted phrase queries.
  
   So how can I index a field without TF and use it in edismax qf?
 
  If you omit positions, you can't do phrase queries.  As far as I know,
  there is no option in Solr to omit only frequencies and not positions.
 
  I think there is a way that you can achieve what you want, though.  What
  you are looking for is filters.  The fq parameter (filter query) will
  restrict the result set to only entries that match the query, but will
  not affect the relevancy score *at all*.  Here is an example of a filter
  query that restricts the results to items that are in stock, assuming
  you have the appropriate schema:
 
  fq=inStock:true
 
  Queries specified in fq will default to the lucene query parser, but you
  can override that if you need to.  This query would be equivalent to the
  previous one, but it would be parsed using edismax:
 
  fq={!edismax}inStock:true
 
  Here's another example of a useful filter, using yet another query
 parser:
 
  fq={!terms f=userId}bob,alice,susan
 
  Remember, the reason I have suggested filters is that they do not
  influence score.
 
 
 
 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
 
  Thanks,
  Shawn
 
 



omitTermFreqAndPositions issue

2015-04-08 Thread Ryan Josal
Hey guys, it seems that omitTermFreqAndPositions is not very usable with
edismax, and I'm wondering if this is intended behavior, and how I can get
around the problem.

The setup:
define field foo with omitTermFreqAndPositions=true

The query:
q=ground coffeeqf=foo bar baz

The error:
IllegalStateException: field foo indexed without position data; cannot
run PhraseQuery.

It would actually be ok for us to index position data but there isn't an
option for that without term frequencies.  No TF is important for us when
it comes to searching product titles.

I should say that only a small fraction of user queries contained quoted
phrases that trigger this error, so it works much of the time, but we'd
also like to continue supporting user quoted phrase queries.

So how can I index a field without TF and use it in edismax qf?

Thanks for your help!
Ryan


Re: omitTermFreqAndPositions issue

2015-04-08 Thread Ryan Josal
Thanks for your thought Shawn, I don't think fq will be helpful here.  The
field for which I want to turn TF off is title, which is actually one of
the primary components of score, so I really need it in qf.  I just don't
want the TF portion of the score for that field only.  I don't want it to
issue phrase queries to that field ever, but if the user quotes something,
it does, and I don't know how to make it stop.  To me it seems potentially
more appropriate to send that to the pf fields, although I can think of a
couple good reasons to put it against qf.  That's fine as long as it
doesn't try to build a phrase query against a no TF no pos field.

Ryan

On Wednesday, April 8, 2015, Shawn Heisey apa...@elyograg.org wrote:

 On 4/8/2015 5:06 PM, Ryan Josal wrote:
  The error:
  IllegalStateException: field foo indexed without position data; cannot
  run PhraseQuery.
 
  It would actually be ok for us to index position data but there isn't an
  option for that without term frequencies.  No TF is important for us when
  it comes to searching product titles.
 
  I should say that only a small fraction of user queries contained quoted
  phrases that trigger this error, so it works much of the time, but we'd
  also like to continue supporting user quoted phrase queries.
 
  So how can I index a field without TF and use it in edismax qf?

 If you omit positions, you can't do phrase queries.  As far as I know,
 there is no option in Solr to omit only frequencies and not positions.

 I think there is a way that you can achieve what you want, though.  What
 you are looking for is filters.  The fq parameter (filter query) will
 restrict the result set to only entries that match the query, but will
 not affect the relevancy score *at all*.  Here is an example of a filter
 query that restricts the results to items that are in stock, assuming
 you have the appropriate schema:

 fq=inStock:true

 Queries specified in fq will default to the lucene query parser, but you
 can override that if you need to.  This query would be equivalent to the
 previous one, but it would be parsed using edismax:

 fq={!edismax}inStock:true

 Here's another example of a useful filter, using yet another query parser:

 fq={!terms f=userId}bob,alice,susan

 Remember, the reason I have suggested filters is that they do not
 influence score.


 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter

 Thanks,
 Shawn




Re: sort on facet.index?

2015-04-02 Thread Ryan Josal
Sorting the result set or the facets?  For the facets there is
facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
are asking if you can sort by index, but reversed?  I don't think this is
possible, and it's a good question.  I wanted to chime in on this one
because I wanted my own facet.sort=rank, but there is no nice pluggable way
to implement a new sort.  I'd love to be able to add a Comparator for a new
sort.  I ended up subclassing FacetComponent to sort of hack on the rank
sort implementation but it isn't very pretty and I'm sure not as efficient
as it could be if FacetComponent was designed for more sorts.

Ryan

On Thursday, April 2, 2015, Derek Poh d...@globalsources.com wrote:

 Is sorting on facet index supported?

 I would like to sort on the below facet index

 lst name=P_SupplierRanking
 int name=014/int
 int name=18/int
 int name=212/int
 int name=3349/int
 int name=481/int
 int name=58/int
 int name=612/int
 /lst

 to

 lst name=P_SupplierRanking
 int name=612/int
 int name=58/int
 int name=481/int
 int name=3349/int
 ...
 ...
 ...
 /lst

 -Derek



Re: sort on facet.index?

2015-04-02 Thread Ryan Josal
Awesome, I didn't know this feature was going to add so much power!
Looking forward to using it.

On Thursday, April 2, 2015, Yonik Seeley ysee...@gmail.com wrote:

 On Thu, Apr 2, 2015 at 10:25 AM, Ryan Josal rjo...@gmail.com
 javascript:; wrote:
  Sorting the result set or the facets?  For the facets there is
  facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
  are asking if you can sort by index, but reversed?  I don't think this is
  possible, and it's a good question.

 The new facet module that will be in Solr 5.1 supports sorting both
 directions on both count and index order (as well as by statistics /
 bucket aggregations).
 http://yonik.com/json-facet-api/

 -Yonik



DocTransformer#setContext

2015-03-20 Thread Ryan Josal
Hey guys, I wanted to ask if I'm using the DocTransformer API as intended.
There is a setContext( TransformerContext c ) method which is called by the
TextResponseWriter before it calls transform on any docs.  That context
object contains a DocIterator reference.  I want to use a DocTransformer to
add info from DynamoDB based on the uniquekeys of docs, so I figured this
would be the way to go to get all needed data from DDB in a batch before
transform.

Turns out if you call nextDoc on that iterator, that doc will not be
transformed because the iterator is not reset or regenerated in any way
before transformations start being called.  In some cases, if the Collector
collected extra docs, the DocSlice will have more docids to return even
after hasNext, and the code doesn't check that, so it will transform
those.  Then eventually it may throw an IndexOutOfBoundsException.  My gut
says this is not intended.  Why not give the DocList in the
TransformContext?

So in the example solrconfig, I think there is a suggestion to use
DocTransformers to get data from external DBs, but has anyone done this,
and how do they handle making a single/batch request instead of doing one
for every transform call?

Ryan


Re: rankquery usage bug?

2015-02-24 Thread Ryan Josal
Ticket filed, thanks!
https://issues.apache.org/jira/browse/SOLR-7152

On Fri, Feb 20, 2015 at 9:29 PM, Joel Bernstein joels...@gmail.com wrote:

 Ryan,

 This looks like a good jira ticket to me.

 Joel Bernstein
 Search Engineer at Heliosearch

 On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal rjo...@gmail.com wrote:

  Hey guys, I put a rq in defaults but I can't figure out how to override
 it
  with no rankquery.  Looks like one option might be checking for empty
  string before trying to use it in QueryComponent?  I can work around it
 in
  the prep method of an earlier searchcomponent for now.
 
  Ryan
 



Re: Solr synonyms logic

2015-02-21 Thread Ryan Josal
What you are describing is hyponymy.  Pastry is the hypernym.  You can
accomplish this by not using expansion, for example:
cannelloni = cannelloni, pastry

This has the result of adding pastry to the index.

Ryan

On Saturday, February 21, 2015, Mikhail Khludnev mkhlud...@griddynamics.com
wrote:

 Hello,

 usually debugQuery=true output explains a lot of such details.

 On Sat, Feb 21, 2015 at 10:52 AM, davym dmey...@luon.com javascript:;
 wrote:

  Hi all,
 
  I'm querying a recipe database in Solr. By using synonyms, I'm trying to
  make my search a little smarter.
 
  What I'm trying to do here, is that a search for pastry returns all
  lasagne,
  penne  cannelloni recipes.
  However a search for lasagne should only return lasagne recipes.
 
  In my synonyms.txt, I have these lines:
  -
  lasagne,pastry
  penne,pastry
  cannelloni,pastry
  -
 
  Filter in my scheme.xml looks like this:
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.WhitespaceTokenizerFactory /
  Only in the index analyzer, not in the query.
 
  When using the Solr analysis tool, I can see that my index for lasagne
 has
  a
  synonym pastry and my query only queries lasagne. Same for penne and
  cannelloni, they both have the synonym pastry.
 
  Currently my Solr query for lasagne also returns all penne and cannelloni
  recipes. I cannot understand why this is the case.
 
  Can someone explain this behaviour to me please?
 
 
 
 
 
 
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Solr-synonyms-logic-tp4187827.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com javascript:;



rankquery usage bug?

2015-02-20 Thread Ryan Josal
Hey guys, I put a rq in defaults but I can't figure out how to override it
with no rankquery.  Looks like one option might be checking for empty
string before trying to use it in QueryComponent?  I can work around it in
the prep method of an earlier searchcomponent for now.

Ryan


Custom facet.sort

2015-02-16 Thread Ryan Josal
Hey guys, I have a desire to order (field) facets by their order of
appearance in the search results.

When I first thought about it, I figured there would be some way to plug a
custom Comparator into FacetComponent and link it to facet.sort=rank or
something like that, but not only is there no real way to plug in a custom
sort (nor is subclassing the component feasible), the complexity is further
compounded by the fact faceting really only operates on the docset and so
scores aren't available.  If max(score) was an attribute of a facetcount
object this type of sort could be done.  sum(score) might also be
interesting for a weighted approach.  I can imagine performance concerns
with doing this though.  Operating on the doclist isn't enough because it's
only a slice of the results.  What if I reduce my scope to only needing the
top 2 facets in order?  It still seems to be just as complex because you
have to start from the first page and request an extra long docslice from
QueryComponent by hacking the start/rows params, and for all you know you
need to get to the last document to get all the facets.

So does anyone have any ideas of how to implement this?  Maybe it isn't
even through faceting.

Ryan


Re: An interesting approach to grouping

2015-01-27 Thread Ryan Josal
This is great, thanks Jim.  Your patch worked and the sorting solution
meets the goal, although group.limit seems like it could cut various
results out of the middle of the result set.  I will play around with it
and see if it proves helpful.  Can you let me know the Jira so I can keep
an eye on it?

Ryan

On Tuesday, January 27, 2015, Jim.Musil jim.mu...@target.com wrote:

 Interestingly, you can do something like this:

 group=true
 group.main=true
 group.func=rint(scale(query({!type=edismax v=$q}),0,20)) // puts into
 buckets
 group.limit=20 // gives you 20 from each bucket
 group.sort=category asc  // this will sort by category within each bucket,
 but this can be a function as well.



 Jim Musil



 On 1/27/15, 10:14 AM, Jim.Musil jim.mu...@target.com javascript:;
 wrote:

 When using group.main=true, the results are not mixed as you expect:
 
 If true, the result of the last field grouping command is used as the
 main result list in the response, using group.format=simple”
 
 https://wiki.apache.org/solr/FieldCollapsing
 
 
 Jim
 
 On 1/27/15, 9:22 AM, Ryan Josal rjo...@gmail.com javascript:;
 wrote:
 
 Thanks a lot!  I'll try this out later this morning.  If group.func and
 group.field don't combine the way I think they might, I'll try to look
 for
 a way to put it all in group.func.
 
 On Tuesday, January 27, 2015, Jim.Musil jim.mu...@target.com
 javascript:; wrote:
 
  I¹m not sure the query you provided will do what you want, BUT I did
 find
  the bug in the code that is causing the NullPointerException.
 
  The variable context is supposed to be global, but when prepare() is
  called, it is only defined in the scope of that function.
 
  Here¹s the simple patch:
 
  Index: core/src/java/org/apache/solr/search/Grouping.java
  ===
  --- core/src/java/org/apache/solr/search/Grouping.java  (revision
 1653358)
  +++ core/src/java/org/apache/solr/search/Grouping.java  (working copy)
  @@ -926,7 +926,7 @@
*/
   @Override
   protected void prepare() throws IOException {
  -  Map context = ValueSource.newContext(searcher);
  +  context = ValueSource.newContext(searcher);
 groupBy.createWeight(context, searcher);
 actualGroupsToFind = getMax(offset, numGroups, maxDoc);
   }
 
 
  I¹ll search for a Jira issue and open if I can¹t find one.
 
  Jim Musil
 
 
 
  On 1/26/15, 6:34 PM, Ryan Josal r...@josal.com javascript:;
 javascript:;
 wrote:
 
  I have an index of products, and these products have a category
 which we
  can say for now is a good approximation of its location in the store.
 I'm
  investigating altering the ordering of the results so that the
 categories
  aren't interlaced as much... so that the results are a little bit more
  grouped by category, but not *totally* grouped by category.  It's
  interesting because it's an approach that sort of compares results to
  near-scored/ranked results.  One of the hoped outcomes of this would
 that
  there would be somewhat fewer categories represented in the top
 results
  for
  a given query, although it is questionable if this is a good
 measurement
  to
  determine the effectiveness of the implementation.
  
  My first attempt was to
 
 group=truegroup.main=truegroup.field=categorygroup.func=rint(scale(q
 u
 er
  y({!type=edismax
  v=$q}),0,20))
  
  Or some FunctionQuery like that, so that in order to become a member
 of a
  group, the doc would have to have the same category, and be dropped
 into
  the same score bucket (20 in this case).  This doesn't work out of the
  gate
  due to an NPE (solr 4.10.2) (although I'm not sure it would work
 anyway):
  
  java.lang.NullPointerException\n\tat
 
 org.apache.lucene.queries.function.valuesource.ScaleFloatFunction.getVa
 l
 ue
  s(ScaleFloatFunction.java:104)\n\tat
 
 org.apache.solr.search.DoubleParser$Function.getValues(ValueSourceParse
 r
 .j
  ava:)\n\tat
 
 org.apache.lucene.search.grouping.function.FunctionFirstPassGroupingCol
 l
 ec
  tor.setNextReader(FunctionFirstPassGroupingCollector.java:82)\n\tat
 
 org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.ja
 v
 a:
  113)\n\tat
 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)\n
 \
 ta
  t
 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n
 \
 ta
  t
 
 org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:451
 )
 \n
  \tat
  org.apache.solr.search.Grouping.execute(Grouping.java:368)\n\tat
 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent
 .
 ja
  va:459)\n\tat
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(Searc
 h
 Ha
  ndler.java:218)\n\tat
  
  
  Has anyone tried something like this before, and does anyone have any
  novel
  ideas for how to approach it, no matter how different?  How about a
  workaround for the group.func error here?  I'm very open-minded about
  where
  to go on this one.
  
  Thanks

Re: An interesting approach to grouping

2015-01-27 Thread Ryan Josal
Thanks a lot!  I'll try this out later this morning.  If group.func and
group.field don't combine the way I think they might, I'll try to look for
a way to put it all in group.func.

On Tuesday, January 27, 2015, Jim.Musil jim.mu...@target.com wrote:

 I¹m not sure the query you provided will do what you want, BUT I did find
 the bug in the code that is causing the NullPointerException.

 The variable context is supposed to be global, but when prepare() is
 called, it is only defined in the scope of that function.

 Here¹s the simple patch:

 Index: core/src/java/org/apache/solr/search/Grouping.java
 ===
 --- core/src/java/org/apache/solr/search/Grouping.java  (revision 1653358)
 +++ core/src/java/org/apache/solr/search/Grouping.java  (working copy)
 @@ -926,7 +926,7 @@
   */
  @Override
  protected void prepare() throws IOException {
 -  Map context = ValueSource.newContext(searcher);
 +  context = ValueSource.newContext(searcher);
groupBy.createWeight(context, searcher);
actualGroupsToFind = getMax(offset, numGroups, maxDoc);
  }


 I¹ll search for a Jira issue and open if I can¹t find one.

 Jim Musil



 On 1/26/15, 6:34 PM, Ryan Josal r...@josal.com javascript:; wrote:

 I have an index of products, and these products have a category which we
 can say for now is a good approximation of its location in the store.  I'm
 investigating altering the ordering of the results so that the categories
 aren't interlaced as much... so that the results are a little bit more
 grouped by category, but not *totally* grouped by category.  It's
 interesting because it's an approach that sort of compares results to
 near-scored/ranked results.  One of the hoped outcomes of this would that
 there would be somewhat fewer categories represented in the top results
 for
 a given query, although it is questionable if this is a good measurement
 to
 determine the effectiveness of the implementation.
 
 My first attempt was to
 group=truegroup.main=truegroup.field=categorygroup.func=rint(scale(quer
 y({!type=edismax
 v=$q}),0,20))
 
 Or some FunctionQuery like that, so that in order to become a member of a
 group, the doc would have to have the same category, and be dropped into
 the same score bucket (20 in this case).  This doesn't work out of the
 gate
 due to an NPE (solr 4.10.2) (although I'm not sure it would work anyway):
 
 java.lang.NullPointerException\n\tat
 org.apache.lucene.queries.function.valuesource.ScaleFloatFunction.getValue
 s(ScaleFloatFunction.java:104)\n\tat
 org.apache.solr.search.DoubleParser$Function.getValues(ValueSourceParser.j
 ava:)\n\tat
 org.apache.lucene.search.grouping.function.FunctionFirstPassGroupingCollec
 tor.setNextReader(FunctionFirstPassGroupingCollector.java:82)\n\tat
 org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.java:
 113)\n\tat
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)\n\ta
 t
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n\ta
 t
 org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:451)\n
 \tat
 org.apache.solr.search.Grouping.execute(Grouping.java:368)\n\tat
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.ja
 va:459)\n\tat
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
 ndler.java:218)\n\tat
 
 
 Has anyone tried something like this before, and does anyone have any
 novel
 ideas for how to approach it, no matter how different?  How about a
 workaround for the group.func error here?  I'm very open-minded about
 where
 to go on this one.
 
 Thanks,
 Ryan




An interesting approach to grouping

2015-01-26 Thread Ryan Josal
I have an index of products, and these products have a category which we
can say for now is a good approximation of its location in the store.  I'm
investigating altering the ordering of the results so that the categories
aren't interlaced as much... so that the results are a little bit more
grouped by category, but not *totally* grouped by category.  It's
interesting because it's an approach that sort of compares results to
near-scored/ranked results.  One of the hoped outcomes of this would that
there would be somewhat fewer categories represented in the top results for
a given query, although it is questionable if this is a good measurement to
determine the effectiveness of the implementation.

My first attempt was to
group=truegroup.main=truegroup.field=categorygroup.func=rint(scale(query({!type=edismax
v=$q}),0,20))

Or some FunctionQuery like that, so that in order to become a member of a
group, the doc would have to have the same category, and be dropped into
the same score bucket (20 in this case).  This doesn't work out of the gate
due to an NPE (solr 4.10.2) (although I'm not sure it would work anyway):

java.lang.NullPointerException\n\tat
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction.getValues(ScaleFloatFunction.java:104)\n\tat
org.apache.solr.search.DoubleParser$Function.getValues(ValueSourceParser.java:)\n\tat
org.apache.lucene.search.grouping.function.FunctionFirstPassGroupingCollector.setNextReader(FunctionFirstPassGroupingCollector.java:82)\n\tat
org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.java:113)\n\tat
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)\n\tat
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n\tat
org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:451)\n\tat
org.apache.solr.search.Grouping.execute(Grouping.java:368)\n\tat
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:459)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat


Has anyone tried something like this before, and does anyone have any novel
ideas for how to approach it, no matter how different?  How about a
workaround for the group.func error here?  I'm very open-minded about where
to go on this one.

Thanks,
Ryan


Re: Dynamically loaded core.properties file

2014-08-21 Thread Ryan Josal
Thanks Erick, I tested that does work, and provide a solution to my 
problem!  So property expansion does work in core.properties, I did not 
know that, and I got the impression from Chris' comment that that would 
open up a can of worms when it comes to persisting core.properties.  I 
guess while the can's open, I'll eat up.


Just for fun I tried property expansion in my referenced subproperties 
file and it didn't work, which is fine for me.


Ryan

On 08/20/2014 04:11 PM, Erick Erickson wrote:

OK, not quite sure if this would work, but

In each core.properties file, put in a line similar to what Chris suggested:
properties=${env}/custom.properties

You might be able to now define your sys var like
-Drelative_or_absolute_path_to_dev_custom.proerties file.
or
-Drelative_or_absolute_path_to_prod_custom.proerties file.
on Solr startup. Then in the custom.properties file you have whatever
you need to define to make the prod/dev distinction you need.

WARNING: I'm not entirely sure that relative pathing works here, which
just means I haven't tried it.

Best,
Erick

On Wed, Aug 20, 2014 at 3:11 PM, Ryan Josal ry...@pointinside.com wrote:

Thanks Erick, that mirrors my thoughts exactly.  If core.properties had
property expansion it would work for this, but I agree with not supporting
that for the complexities it introduces, and I'm not sure it's the right way
to solve it anyway.  So, it doesn't really handle my problem.

I think because the properties file I want to load is not actually related
to any core, it makes it easier to solve.  So if solr.xml is no longer
rewritten then it seems like a global properties file could safely be
specified there using property expansion.  Or maybe there is some way to
write some code that could get executed before schema and solrconfig are
parsed, although I'm not sure how that would work given how you need
solrconfig to load the libraries and define plugins.

Ryan


On 08/20/2014 01:07 PM, Erick Erickson wrote:

Hmmm, I was going to make a code change to do this, but Chris
Hostetter saved me from the madness that ensues. Here's his comment on
the JIRA that I did open (but then closed), does this handle your
problem?

I don't think we want to make the name of core.properties be variable
... that way leads to madness and confusion.

the request on the user list was about being able to dynamically load
a property file with diff values between dev  production like you
could do in the old style solr.xml – that doesn't mean core.properties
needs to have a configurable name, it just means there needs to be a
configurable way to load properties.

we already have a properties option which can be specified in
core.properties to point to an additional external file that should
also be loaded ... if variable substitution was in play when parsing
core.properties then you could have something like
properties=custom.${env}.properties in core.properties ... but
introducing variable substitution into thecore.properties (which solr
both reads  writes based on CoreAdmin calls) brings back the host of
complexities involved when we had persistence of solr.xml as a
feature, with the questions about persisting the original values with
variables in them, vs the values after evaluating variables.

Best,
Erick

On Wed, Aug 20, 2014 at 11:36 AM, Ryan Josal ry...@pointinside.com
wrote:

Hi all, I have a question about dynamically loading a core properties
file
with the new core discovery method of defining cores.  The concept is
that I
can have a dev.properties file and a prod.properties file, and specify
which
one to load with -Dsolr.env=dev.  This way I can have one file which
specifies a bunch of runtime properties like external servers a plugin
might
use, etc.

Previously I was able to do this in solr.xml because it can do system
property substitution when defining which properties file to use for a
core.

Now I'm not sure how to do this with core discovery, since the core is
discovered based on this file, and now the file needs to contain things
that
are specific to that core, like name, which previously were defined in
the
xml definition.

Is there a way I can plugin some code that gets run before any schema or
solrconfigs are parsed?  That way I could write a property loader that
adds
properties from ${solr.env}.properties to the JVM system properties.

Thanks!
Ryan






Dynamically loaded core.properties file

2014-08-20 Thread Ryan Josal
Hi all, I have a question about dynamically loading a core properties 
file with the new core discovery method of defining cores.  The concept 
is that I can have a dev.properties file and a prod.properties file, and 
specify which one to load with -Dsolr.env=dev.  This way I can have one 
file which specifies a bunch of runtime properties like external servers 
a plugin might use, etc.


Previously I was able to do this in solr.xml because it can do system 
property substitution when defining which properties file to use for a core.


Now I'm not sure how to do this with core discovery, since the core is 
discovered based on this file, and now the file needs to contain things 
that are specific to that core, like name, which previously were defined 
in the xml definition.


Is there a way I can plugin some code that gets run before any schema or 
solrconfigs are parsed?  That way I could write a property loader that 
adds properties from ${solr.env}.properties to the JVM system properties.


Thanks!
Ryan


Re: Dynamically loaded core.properties file

2014-08-20 Thread Ryan Josal
Thanks Erick, that mirrors my thoughts exactly.  If core.properties had 
property expansion it would work for this, but I agree with not 
supporting that for the complexities it introduces, and I'm not sure 
it's the right way to solve it anyway.  So, it doesn't really handle my 
problem.


I think because the properties file I want to load is not actually 
related to any core, it makes it easier to solve.  So if solr.xml is no 
longer rewritten then it seems like a global properties file could 
safely be specified there using property expansion.  Or maybe there is 
some way to write some code that could get executed before schema and 
solrconfig are parsed, although I'm not sure how that would work given 
how you need solrconfig to load the libraries and define plugins.


Ryan

On 08/20/2014 01:07 PM, Erick Erickson wrote:

Hmmm, I was going to make a code change to do this, but Chris
Hostetter saved me from the madness that ensues. Here's his comment on
the JIRA that I did open (but then closed), does this handle your
problem?

I don't think we want to make the name of core.properties be variable
... that way leads to madness and confusion.

the request on the user list was about being able to dynamically load
a property file with diff values between dev  production like you
could do in the old style solr.xml – that doesn't mean core.properties
needs to have a configurable name, it just means there needs to be a
configurable way to load properties.

we already have a properties option which can be specified in
core.properties to point to an additional external file that should
also be loaded ... if variable substitution was in play when parsing
core.properties then you could have something like
properties=custom.${env}.properties in core.properties ... but
introducing variable substitution into thecore.properties (which solr
both reads  writes based on CoreAdmin calls) brings back the host of
complexities involved when we had persistence of solr.xml as a
feature, with the questions about persisting the original values with
variables in them, vs the values after evaluating variables.

Best,
Erick

On Wed, Aug 20, 2014 at 11:36 AM, Ryan Josal ry...@pointinside.com wrote:

Hi all, I have a question about dynamically loading a core properties file
with the new core discovery method of defining cores.  The concept is that I
can have a dev.properties file and a prod.properties file, and specify which
one to load with -Dsolr.env=dev.  This way I can have one file which
specifies a bunch of runtime properties like external servers a plugin might
use, etc.

Previously I was able to do this in solr.xml because it can do system
property substitution when defining which properties file to use for a core.

Now I'm not sure how to do this with core discovery, since the core is
discovered based on this file, and now the file needs to contain things that
are specific to that core, like name, which previously were defined in the
xml definition.

Is there a way I can plugin some code that gets run before any schema or
solrconfigs are parsed?  That way I could write a property loader that adds
properties from ${solr.env}.properties to the JVM system properties.

Thanks!
Ryan




RE: Correct way for getting SolrCore?

2013-02-06 Thread Ryan Josal
This is perfect, thanks!  I'm surprised it eluded me for so long.

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 4:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The SolrCoreAware interface?

- Mark

On Feb 5, 2013, at 5:42 PM, Ryan Josal rjo...@rim.com wrote:

 By way of the deprecated SolrCore.getSolrCore method,

 SolrCore.getSolrCore().getCoreDescriptor().getCoreContainer().getCores()

 Solr starts up in an infinite recursive loop of loading cores.  I understand 
 now that the UpdateProcessorFactory is initialized as part of the core 
 initialization, so I expect there is no way to read the index of a core if 
 the core has not been initialized yet.  I still feel a bit uneasy about 
 initialization on the first update request, so is there some other place I 
 can plugin initialization code that runs after the core is loaded?  I suppose 
 I'd be using SolrCore.getSearcher().get().getIndexReader() to get the 
 IndexReader, but if that happens after a good point of plugging in this 
 initialization, then I guess SolrCore.getIndexReaderFactory() is the way to 
 go.

 Thanks,
 Ryan
 
 From: Ryan Josal [rjo...@rim.com]
 Sent: Tuesday, February 05, 2013 1:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Correct way for getting SolrCore?

 Is there any way I can get the cores and do my initialization in the 
 @Override public void init(final NamedList args) method?  I could wait for 
 the first request, but I imagine I'd have to deal with indexing requests 
 piling up while I iterate over every document in every index.

 Ryan
 
 From: Mark Miller [markrmil...@gmail.com]
 Sent: Tuesday, February 05, 2013 1:15 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Correct way for getting SolrCore?

 The request should give you access to the core - the core to the core 
 descriptor, the descriptor to the core container, which knows about all the 
 cores.

 - Mark

 On Feb 5, 2013, at 4:09 PM, Ryan Josal rjo...@rim.com wrote:

 Hey guys,

 I am writing an UpdateRequestProcessorFactory plugin which needs to have 
 some initialization code in the init method.  I need to build some 
 information about each SolrCore in memory so that when an update comes in 
 for a particular SolrCore, I can use the data for the appropriate core.  
 Ultimately, I need a lucene IndexReader for each core.  I figure I'd get 
 this through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked 
 around for awhile and I always end up going in circles.  So how can I 
 iterate over cores that have been loaded?

 Ryan
 -
 This transmission (including any attachments) may contain confidential 
 information, privileged material (including material protected by the 
 solicitor-client or other applicable privileges), or constitute non-public 
 information. Any use of this information by anyone other than the intended 
 recipient is prohibited. If you have received this transmission in error, 
 please immediately reply to the sender and delete this information from your 
 system. Use, dissemination, distribution, or reproduction of this 
 transmission by unintended recipients is not authorized and may be unlawful.


 -
 This transmission (including any attachments) may contain confidential 
 information, privileged material (including material protected by the 
 solicitor-client or other applicable privileges), or constitute non-public 
 information. Any use of this information by anyone other than the intended 
 recipient is prohibited. If you have received this transmission in error, 
 please immediately reply to the sender and delete this information from your 
 system. Use, dissemination, distribution, or reproduction of this 
 transmission by unintended recipients is not authorized and may be unlawful.

 -
 This transmission (including any attachments) may contain confidential 
 information, privileged material (including material protected by the 
 solicitor-client or other applicable privileges), or constitute non-public 
 information. Any use of this information by anyone other than the intended 
 recipient is prohibited. If you have received this transmission in error, 
 please immediately reply to the sender and delete this information from your 
 system. Use, dissemination, distribution, or reproduction of this 
 transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges

Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
Hey guys,

  I am writing an UpdateRequestProcessorFactory plugin which needs to have some 
initialization code in the init method.  I need to build some information about 
each SolrCore in memory so that when an update comes in for a particular 
SolrCore, I can use the data for the appropriate core.  Ultimately, I need a 
lucene IndexReader for each core.  I figure I'd get this through a SolrCore, 
CoreContainer, or CoreDescriptor.  I've looked around for awhile and I always 
end up going in circles.  So how can I iterate over cores that have been loaded?

Ryan
-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
Is there any way I can get the cores and do my initialization in the @Override 
public void init(final NamedList args) method?  I could wait for the first 
request, but I imagine I'd have to deal with indexing requests piling up while 
I iterate over every document in every index.

Ryan

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 1:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The request should give you access to the core - the core to the core 
descriptor, the descriptor to the core container, which knows about all the 
cores.

- Mark

On Feb 5, 2013, at 4:09 PM, Ryan Josal rjo...@rim.com wrote:

 Hey guys,

  I am writing an UpdateRequestProcessorFactory plugin which needs to have 
 some initialization code in the init method.  I need to build some 
 information about each SolrCore in memory so that when an update comes in for 
 a particular SolrCore, I can use the data for the appropriate core.  
 Ultimately, I need a lucene IndexReader for each core.  I figure I'd get this 
 through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked around for 
 awhile and I always end up going in circles.  So how can I iterate over cores 
 that have been loaded?

 Ryan
 -
 This transmission (including any attachments) may contain confidential 
 information, privileged material (including material protected by the 
 solicitor-client or other applicable privileges), or constitute non-public 
 information. Any use of this information by anyone other than the intended 
 recipient is prohibited. If you have received this transmission in error, 
 please immediately reply to the sender and delete this information from your 
 system. Use, dissemination, distribution, or reproduction of this 
 transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
By way of the deprecated SolrCore.getSolrCore method,

SolrCore.getSolrCore().getCoreDescriptor().getCoreContainer().getCores()

Solr starts up in an infinite recursive loop of loading cores.  I understand 
now that the UpdateProcessorFactory is initialized as part of the core 
initialization, so I expect there is no way to read the index of a core if the 
core has not been initialized yet.  I still feel a bit uneasy about 
initialization on the first update request, so is there some other place I can 
plugin initialization code that runs after the core is loaded?  I suppose I'd 
be using SolrCore.getSearcher().get().getIndexReader() to get the IndexReader, 
but if that happens after a good point of plugging in this initialization, then 
I guess SolrCore.getIndexReaderFactory() is the way to go.

Thanks,
Ryan

From: Ryan Josal [rjo...@rim.com]
Sent: Tuesday, February 05, 2013 1:27 PM
To: solr-user@lucene.apache.org
Subject: RE: Correct way for getting SolrCore?

Is there any way I can get the cores and do my initialization in the @Override 
public void init(final NamedList args) method?  I could wait for the first 
request, but I imagine I'd have to deal with indexing requests piling up while 
I iterate over every document in every index.

Ryan

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 1:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The request should give you access to the core - the core to the core 
descriptor, the descriptor to the core container, which knows about all the 
cores.

- Mark

On Feb 5, 2013, at 4:09 PM, Ryan Josal rjo...@rim.com wrote:

 Hey guys,

  I am writing an UpdateRequestProcessorFactory plugin which needs to have 
 some initialization code in the init method.  I need to build some 
 information about each SolrCore in memory so that when an update comes in for 
 a particular SolrCore, I can use the data for the appropriate core.  
 Ultimately, I need a lucene IndexReader for each core.  I figure I'd get this 
 through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked around for 
 awhile and I always end up going in circles.  So how can I iterate over cores 
 that have been loaded?

 Ryan
 -
 This transmission (including any attachments) may contain confidential 
 information, privileged material (including material protected by the 
 solicitor-client or other applicable privileges), or constitute non-public 
 information. Any use of this information by anyone other than the intended 
 recipient is prohibited. If you have received this transmission in error, 
 please immediately reply to the sender and delete this information from your 
 system. Use, dissemination, distribution, or reproduction of this 
 transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: SolrJ DirectXmlRequest

2013-01-23 Thread Ryan Josal
Thanks Hoss,

  The issue mentioned describes a similar behavior to what I observed, but not 
quite.  Commons-fileupload creates java.io.File objects for the temp files, and 
when those Files are garbage collected, the temp file is deleted.  I've 
verified this by letting the temp files build up and then forcing a full 
collection which clears all of them.  So I think the reason a percentage of 
temp files built up in my system was that under heavy load, some of the 
java.io.Files made it into old gen in the heap.  I switched to G1, and the 
problem went away.

Regarding the how the XML files are being sent, I have verified that each XML 
file is sent as a single request, by aligning the access log of my Solr master 
server with the processing log of my SolrJ server.  I didn't test the requests 
to see if the MIME type is multipart, but I suppose it is possible if some 
other form data or instruction needed to be passed with it.  Either way, I 
suppose it would go through fileupload anyway, because somebody's got to make a 
temp file for large files, right?

Ryan

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Wednesday, January 16, 2013 6:06 PM
To: solr-user@lucene.apache.org
Subject: RE: SolrJ DirectXmlRequest

: DirectXmlRequest is part of the SolrJ library, so I guess that means it
: is not commonly used.  My use case is that I'm applying an XSLT to the
: raw XML on the client side, instead of leaving that up to the Solr
: master (although even if I applied the XSLT on the Solr server, I'd

I think Otis's point was that most people don't have Solr XML files lying
arround that they send to Solr, nor do they build up XML strings in Java
in the Solr input format (with XSLT or otherwise) ... most people using
SolrJ build up SolrInputDocument objects and pass those to their
SolrServer instance.

: I've done some research and I'm fairly confident that apache
: commons-fileupload library is responsible for the temp files.  There's

I believe you are correct ... searching for solr fileupload temp files
lead me to this issue which seems to have fallen by the way side...

https://issues.apache.org/jira/browse/SOLR-1953

...if you could try that patch outand/or post your comments it would be
helpful.

Something that seems really odd to me however is how/why your basic
updates are even causing multipart/file-upload functionality to be used
... a quick skim of the client code suggests that that should only happen
if your try to send multiple ContentStreams in a single request: I can
understand why that wouldn't typically happen for most users building up
multiple SolrInputDocuments (they would get added to a single stream); and
i can understand why that would typically happen for users sending
multiple binary files to something like ExtractingRequestHandler -- but if
you are using DirectXmlRequest in the way you described each xml file
should be sent as a single stream in a single request and the XML should
be sent in the raw POST body -- the commons-fileupload code shouldn't even
come into play.  (either that, or i'm missing something, or you're using
an older version of solr that used fileupload even if there was only a
single content stream)


-Hoss

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: SolrJ DirectXmlRequest

2013-01-09 Thread Ryan Josal
I also don't know what's creating them.  Maybe Solr, but also maybe Tomcat, 
maybe apache commons.  I could change java.io.tmpdir to one with more space, 
but the problem is that many of the temp files end up permanent, so eventually 
it would still run out of space.  I also considered setting the tmpdir to 
/dev/null, but that would defeat the purpose of whatever is writing those log 
files in the first place.  I could periodically clean up the tmpdir myself, but 
that feels the hackiest.

Is it fairly common to send XML to Solr this way from a remote host?  If it is, 
then that would lead me to believe Solr and any of it's libraries aren't 
causing it, and I should inspect Tomcat.  I'm using Tomcat 7.

Ryan

From: Otis Gospodnetic [otis.gospodne...@gmail.com]
Sent: Tuesday, January 08, 2013 7:29 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ DirectXmlRequest

Hi Ryan,

I'm not sure what is creating those upload files something in Solr? Or
Tomcat?

Why not specify a different temp dir via system property command line
parameter?

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 8, 2013 12:17 PM, Ryan Josal rjo...@rim.com wrote:

 I have encountered an issue where using DirectXmlRequest to index data on
 a remote host results in eventually running out have temp disk space in the
 java.io.tmpdir directory.  This occurs when I process a sufficiently large
 batch of files.  About 30% of the temporary files end up permanent.  The
 filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.  Has
 anyone else had this happen before?  The relevant code is:

 DirectXmlRequest up = new DirectXmlRequest( /update, xml );
 up.process(solr);

 where `xml` is a String containing Solr formatted XML, and `solr` is the
 SolrServer.  When disk space is eventually exhausted, this is the error
 message that is repeatedly seen on the master host:

 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR
 org.apache.solr.servlet.SolrDispatchFilter  [] -
 org.apache.commons.fileupload.FileUploadBase$IOFileUploadException:
 Processing of multipart/form-data request failed. No space left on device
 at
 org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
 at
 org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
 at
 org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
 at
 org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
 at
 org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 ... truncated stack trace

 I am running Solr 3.6 on an Ubuntu 12.04 server.  I am considering working
 around this by pulling out as much as I can from XMLLoader into my client,
 and processing the XML myself into SolrInputDocuments for indexing, but
 this is certainly not ideal.

 Ryan
 -
 This transmission (including any attachments) may contain confidential
 information, privileged material (including material protected by the
 solicitor-client or other applicable privileges), or constitute non-public
 information. Any use of this information by anyone other than the intended
 recipient is prohibited. If you have received this transmission in error,
 please immediately reply to the sender and delete this information from
 your system. Use, dissemination, distribution, or reproduction of this
 transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: SolrJ DirectXmlRequest

2013-01-09 Thread Ryan Josal
Thanks Otis,

DirectXmlRequest is part of the SolrJ library, so I guess that means it is not 
commonly used.  My use case is that I'm applying an XSLT to the raw XML on the 
client side, instead of leaving that up to the Solr master (although even if I 
applied the XSLT on the Solr server, I'd still use DirectXmlRequest to get the 
raw XML there).  This does lead me to the idea that parsing the XML without the 
XSLT is probably better than copying some of XMLLoader to parse Solr XML as a 
workaround, and might be a good idea to do anyway.

I've done some research and I'm fairly confident that apache commons-fileupload 
library is responsible for the temp files.  There's an explanation for how 
files are cleaned up at http://commons.apache.org/fileupload/using.html in the 
Resource cleanup section.  I have observed that forcing a garbage collection 
over JMX results in all temporary files being purged.  This implies that many 
of the java.io.File objects are moving to old gen in the heap which survive 
long enough (only a few minutes in my case) to use up all tmp disk space.

I think this can probably be solved by GC tuning, or, failing that, introducing 
a (less desirable) System.gc() somewhere in the updateRequestProcessorChain.

Thanks for your help, and hopefully this will be useful if someone else runs 
into a similar problem.

Ryan

From: Otis Gospodnetic [otis.gospodne...@gmail.com]
Sent: Wednesday, January 09, 2013 11:53 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ DirectXmlRequest

Hi Ryan,

One typically uses a Solr client library to talk to Solr instead of sending
raw XML.  For example, if your application in written in Java then you
would use SolrJ.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Jan 9, 2013 at 12:03 PM, Ryan Josal rjo...@rim.com wrote:

 I also don't know what's creating them.  Maybe Solr, but also maybe
 Tomcat, maybe apache commons.  I could change java.io.tmpdir to one with
 more space, but the problem is that many of the temp files end up
 permanent, so eventually it would still run out of space.  I also
 considered setting the tmpdir to /dev/null, but that would defeat the
 purpose of whatever is writing those log files in the first place.  I could
 periodically clean up the tmpdir myself, but that feels the hackiest.

 Is it fairly common to send XML to Solr this way from a remote host?  If
 it is, then that would lead me to believe Solr and any of it's libraries
 aren't causing it, and I should inspect Tomcat.  I'm using Tomcat 7.

 Ryan
 
 From: Otis Gospodnetic [otis.gospodne...@gmail.com]
 Sent: Tuesday, January 08, 2013 7:29 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrJ DirectXmlRequest

 Hi Ryan,

 I'm not sure what is creating those upload files something in Solr? Or
 Tomcat?

 Why not specify a different temp dir via system property command line
 parameter?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Jan 8, 2013 12:17 PM, Ryan Josal rjo...@rim.com wrote:

  I have encountered an issue where using DirectXmlRequest to index data on
  a remote host results in eventually running out have temp disk space in
 the
  java.io.tmpdir directory.  This occurs when I process a sufficiently
 large
  batch of files.  About 30% of the temporary files end up permanent.  The
  filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.
  Has
  anyone else had this happen before?  The relevant code is:
 
  DirectXmlRequest up = new DirectXmlRequest( /update, xml );
  up.process(solr);
 
  where `xml` is a String containing Solr formatted XML, and `solr` is the
  SolrServer.  When disk space is eventually exhausted, this is the error
  message that is repeatedly seen on the master host:
 
  2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR
  org.apache.solr.servlet.SolrDispatchFilter  [] -
  org.apache.commons.fileupload.FileUploadBase$IOFileUploadException:
  Processing of multipart/form-data request failed. No space left on device
  at
 
 org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
  at
 
 org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
  at
 
 org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
  at
 
 org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
  at
 
 org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  ... truncated stack trace
 
  I am running Solr 3.6 on an Ubuntu 12.04 server.  I am considering
 working
  around this by pulling out

SolrJ DirectXmlRequest

2013-01-08 Thread Ryan Josal
I have encountered an issue where using DirectXmlRequest to index data on a 
remote host results in eventually running out have temp disk space in the 
java.io.tmpdir directory.  This occurs when I process a sufficiently large 
batch of files.  About 30% of the temporary files end up permanent.  The 
filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.  Has 
anyone else had this happen before?  The relevant code is:

DirectXmlRequest up = new DirectXmlRequest( /update, xml );
up.process(solr);

where `xml` is a String containing Solr formatted XML, and `solr` is the 
SolrServer.  When disk space is eventually exhausted, this is the error message 
that is repeatedly seen on the master host:

2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR 
org.apache.solr.servlet.SolrDispatchFilter  [] - 
org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing 
of multipart/form-data request failed. No space left on device
at 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
at 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
at 
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
at 
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
at 
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
... truncated stack trace

I am running Solr 3.6 on an Ubuntu 12.04 server.  I am considering working 
around this by pulling out as much as I can from XMLLoader into my client, and 
processing the XML myself into SolrInputDocuments for indexing, but this is 
certainly not ideal.

Ryan
-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.