Re: Tokenizer or Filter ?

2015-01-13 Thread Jack Krupansky
Actually, you may be able to get by using PatternReplaceCharFilterFactory -
copy the source value to two fields, one that treats d2.*/d2 as the
delimiter pattern to delete and then other uses d1.*/d1 as the
delimiter pattern to delete, so the first field has only d1 and then
second has only d2. You can use a second pattern char filter to remove
the [/]d[12 markers as well, probably changing them to a space in both
cases.

See:
http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html

-- Jack Krupansky

On Tue, Jan 13, 2015 at 11:40 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Would it be sufficient for your user case to simply extract all the d1
 into one field and all the d2 in another field? If so, the update
 processor script would be very simple, simply matching all d1.*/d1
 and copying them to a separate field value and same for d2.

 If you want examples of script update processors, see my Solr e-book:

 http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

 -- Jack Krupansky

 On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote:

 Thanks Jack for your advice. Can you please explain me little more, how it
 works? From Apache Wiki it's not to clear for me. I can write some
 javaScript code when i want filtering some data ? In this case i have
 d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i
 want
 filtering d2 bla bla bla /d2, But in other case i want filtering all
 d1  /d1 then i suppose i used it at indexed data and filtering
 from
 them? Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr large boolean filter

2015-01-13 Thread rashmy1
Hello,
We have a similar requirement where a large list of IDs needs to be sent to
SOLR in filter query.
Could someone please help understand if this feature is now supported in the
new versions of SOLR?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747p4179276.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Shawn,

Thanks for the suggestion, but experimentally, in my case the same query with 
facet.method=enum returns in almost the same amount of time.

Regards
David 

 On Tuesday, January 13, 2015 12:02 PM, Shawn Heisey apa...@elyograg.org 
wrote:
   

 On 1/13/2015 10:35 AM, David Smith wrote:
 I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that 
 exhibits the following response times (via the debugQuery option in Solr 
 Admin):
 process: {
  time: 24709,
  query: { time: 54 }, facet: { time: 24574 },


 The query time of 54ms is great and exactly as expected -- this example was a 
 single-term search that returned 3 hits.
 I am trying to get the facet time (24.5 seconds) to be sub-second, and am 
 having no luck.  The facet part of the query is as follows:

 params: { facet.range: eventDate,
  f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
  f.eventDate.facet.range.gap: +1DAY,
  start: 0,

  rows: 10,

  f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

  f.eventDate.facet.mincount: 1,

  facet: true,

  debugQuery: true,
  _: 1421169383802
  }

 And, the relevant schema definition is as follows:

    field name=eventDate type=tdate indexed=true stored=true 
multiValued=false docValues=true/

    !-- A Trie based date field for faster date range queries and date 
faceting. --
    fieldType name=tdate class=solr.TrieDateField precisionStep=6 
positionIncrementGap=0/


 During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O 
 activity detected on the drive that holds the 175GB index.  I have 48GB of 
 RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.

 I do NOT have any fieldValue caches configured as yet, because my (perhaps 
 too simplistic?) reading of the documentation was that DocValues eliminates 
 the need for a field-level cache on this facet field.

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

What I would try immediately is setting the facet.method parameter to
enum and seeing what that does to the facet time.  I've had good luck
generally with that, even in situations where the docs indicated that
the default (fc) was supposed to work better.  I have never explored the
relationship between facet.method and docValues, though.

I'm out of ideas after this.  I don't have enough experience with
faceting to help much.

Thanks,
Shawn



   

Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that 
exhibits the following response times (via the debugQuery option in Solr Admin):
process: {
 time: 24709,
 query: { time: 54 }, facet: { time: 24574 },


The query time of 54ms is great and exactly as expected -- this example was a 
single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second, and am 
having no luck.  The facet part of the query is as follows:

params: { facet.range: eventDate,
 f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
 f.eventDate.facet.range.gap: +1DAY,
 start: 0,

 rows: 10,

 f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

 f.eventDate.facet.mincount: 1,

 facet: true,

 debugQuery: true,
 _: 1421169383802
 }

And, the relevant schema definition is as follows:

   field name=eventDate type=tdate indexed=true stored=true 
multiValued=false docValues=true/

    !-- A Trie based date field for faster date range queries and date 
faceting. --
    fieldType name=tdate class=solr.TrieDateField precisionStep=6 
positionIncrementGap=0/


During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O 
activity detected on the drive that holds the 175GB index.  I have 48GB of RAM, 
1/2 of that dedicated to the OS and the other to the Solr JVM.

I do NOT have any fieldValue caches configured as yet, because my (perhaps too 
simplistic?) reading of the documentation was that DocValues eliminates the 
need for a field-level cache on this facet field.

Any suggestions welcome.

Regards,
David



Improved suggester question

2015-01-13 Thread Dan Davis
The suggester is not working for me with Solr 4.10.2

Can anyone shed light over why I might be getting the exception below when
I build the dictionary?

response
lst name=responseHeader
int name=status500/int
int name=QTime26/int
/lst
lst name=error
str name=msglen must be = 32767; got 35680/str
str name=trace
java.lang.IllegalArgumentException: len must be = 32767; got 35680 at
org.apache.lucene.util.OfflineSorter$ByteSequencesWriter.write(OfflineSorter.java:479)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:493)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190) at
org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:160)
at
org.apache.solr.handler.component.SuggestComponent.prepare(SuggestComponent.java:165)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200) at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
/str
int name=code500/int
/lst
/response

Thank you.

I've configured my suggester as follows:

searchComponent name=suggest class=solr.SuggestComponent
  lst name=suggester
str name=namemySuggester/str
str name=lookupImplFuzzyLookupFactory/str
str name=dictionaryImplDocumentDictionaryFactory/str
str name=fieldtext/str
str name=weightFieldmedsite_id/str
str name=suggestAnalyzerFieldTypetext_general/str
str name=buildOnCommittrue/str
str name=threshold0.1/str
  /lst
/searchComponent

requestHandler name=/suggest class=solr.SearchHandler startup=lazy
  lst name=defaults
str name=suggeston/str
str name=suggest.dictionarymySuggester/str
str name=suggest.count10/str
  /lst
  arr name=components
strsuggest/str
  /arr
/requestHandler


Re: Logging in Solr's DataImportHandler

2015-01-13 Thread Dan Davis
Mikhail,

Thanks - it works now.The script transformer was really not needed, a
template transformer is clearer, and the log transformer is now working.

On Mon, Dec 8, 2014 at 1:56 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Hello Dan,

 Usually it works well. Can you describe how you run it particularly, eg
 what you download exactly and what's the command line ?

 On Fri, Dec 5, 2014 at 11:37 PM, Dan Davis dansm...@gmail.com wrote:

 I have a script transformer and a log transformer, and I'm not seeing the
 log messages, at least not where I expect.
 Is there anyway I can simply log a custom message from within my script?
 Can the script easily interact with its containers logger?




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: Highting whole pharse

2015-01-13 Thread Ahmet Arslan
Hi,

hl.usePhraseHighlighter is valid for standard highlighter. May be you are using 
one of the other highlighters?

May be you have omitTermFreqAndPositions=true in definition of text_general 
field type?

Ahmet


On Tuesday, January 13, 2015 5:52 PM, meena.sri...@mathworks.com 
meena.sri...@mathworks.com wrote:
Highlighting does not highlight the whole Phrase, instead each word gets
highlighted.
I tried all the suggestions that was given, with no luck
These are my special setting I tried for phrase highlighting
hl.usePhraseHighlighter=true
hl.q=query


http://localhost.mathworks.com:8983/solr/db/select?q=syndrome%3A%22Override+ignored+for+property%22rows=1fl=syndrome_idwt=jsonindent=truehl=truehl.simple.pre=%3Cem%3Ehl.simple.post=%3C%2Fem%3Ehl.usePhraseHighlighter=truehl.q=%22Override+ignored+for+property%22hl.fragsize=1000



This is from my schema.xml
field name=syndrome type=text_general indexed=true stored=true/


Should I add parameters in the indexing stage itself to make this work?

Thanks for your time.

Meena




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highting-whole-pharse-tp4179219.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Michael and Hoss,

assuming I've written the subclass of the postings format, I need to tell
Solr to use it.

Do I just do something like:

fieldType name=ocr class=solr.TextField postingsFormat=MySubclass /

Is there a way to set this for all fieldtypes or would that require writing
a custom CodecFactory?

Tom


On Mon, Jan 12, 2015 at 4:46 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : It looks like this is a good starting point:
 :
 : http://wiki.apache.org/solr/SolrConfigXml#codecFactory

 The default SchemaCodecFactory already supports defining a diff posting
 format per fieldType - but there isn't much in solr to let you tweak
 individual options on specific posting formats via configuration.

 So what you'd need to do is write a small subclass of
 Lucene41PostingsFormat that called super(yourMin, yourMax) in it's
 constructor.






Suggester questions

2015-01-13 Thread Dan Davis
I am having some trouble getting the suggester to work.   The spell
requestHandler is working, but I didn't like the results I was getting from
the word breaking dictionary and turned them off.
So some basic questions:

   - How can I check on the status of a dictionary?
   - How can I see what is in that dictionary?
   - How do I actually manually rebuild the dictionary - all attempts to
   set spellcheck.build=on or suggest.build=on have led to nearly instant
   results (0 suggestions for the latter), indicating something is wrong.


Thanks,

Daniel Davis


Re: Slow faceting performance on a docValues field

2015-01-13 Thread Shawn Heisey
On 1/13/2015 10:35 AM, David Smith wrote:
 I have a query against a single 50M doc index (175GB) using Solr 4.10.2, that 
 exhibits the following response times (via the debugQuery option in Solr 
 Admin):
 process: {
  time: 24709,
  query: { time: 54 }, facet: { time: 24574 },


 The query time of 54ms is great and exactly as expected -- this example was a 
 single-term search that returned 3 hits.
 I am trying to get the facet time (24.5 seconds) to be sub-second, and am 
 having no luck.  The facet part of the query is as follows:

 params: { facet.range: eventDate,
  f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
  f.eventDate.facet.range.gap: +1DAY,
  start: 0,

  rows: 10,

  f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

  f.eventDate.facet.mincount: 1,

  facet: true,

  debugQuery: true,
  _: 1421169383802
  }

 And, the relevant schema definition is as follows:

field name=eventDate type=tdate indexed=true stored=true 
 multiValued=false docValues=true/

 !-- A Trie based date field for faster date range queries and date 
 faceting. --
 fieldType name=tdate class=solr.TrieDateField precisionStep=6 
 positionIncrementGap=0/


 During the 25-second query, the Solr JVM pegs one CPU, with little or no I/O 
 activity detected on the drive that holds the 175GB index.  I have 48GB of 
 RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.

 I do NOT have any fieldValue caches configured as yet, because my (perhaps 
 too simplistic?) reading of the documentation was that DocValues eliminates 
 the need for a field-level cache on this facet field.

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

What I would try immediately is setting the facet.method parameter to
enum and seeing what that does to the facet time.  I've had good luck
generally with that, even in situations where the docs indicated that
the default (fc) was supposed to work better.  I have never explored the
relationship between facet.method and docValues, though.

I'm out of ideas after this.  I don't have enough experience with
faceting to help much.

Thanks,
Shawn



Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe
Range Faceting won't use the DocValues even if they are there set, it
translates each gap to a filter. This means that it will end up using the
FilterCache, which should cause faster followup queries if you repeat the
same gaps (and don't commit).
You may also want to try interval faceting, it will use DocValues instead
of filters. The API is different, you'll have to provide the intervals
yourself.

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/13/2015 10:35 AM, David Smith wrote:
  I have a query against a single 50M doc index (175GB) using Solr 4.10.2,
 that exhibits the following response times (via the debugQuery option in
 Solr Admin):
  process: {
   time: 24709,
   query: { time: 54 }, facet: { time: 24574 },
 
 
  The query time of 54ms is great and exactly as expected -- this example
 was a single-term search that returned 3 hits.
  I am trying to get the facet time (24.5 seconds) to be sub-second, and
 am having no luck.  The facet part of the query is as follows:
 
  params: { facet.range: eventDate,
   f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
   f.eventDate.facet.range.gap: +1DAY,
   start: 0,
 
   rows: 10,
 
   f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
 
   f.eventDate.facet.mincount: 1,
 
   facet: true,
 
   debugQuery: true,
   _: 1421169383802
   }
 
  And, the relevant schema definition is as follows:
 
 field name=eventDate type=tdate indexed=true stored=true
 multiValued=false docValues=true/
 
  !-- A Trie based date field for faster date range queries and date
 faceting. --
  fieldType name=tdate class=solr.TrieDateField precisionStep=6
 positionIncrementGap=0/
 
 
  During the 25-second query, the Solr JVM pegs one CPU, with little or no
 I/O activity detected on the drive that holds the 175GB index.  I have 48GB
 of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
 
  I do NOT have any fieldValue caches configured as yet, because my
 (perhaps too simplistic?) reading of the documentation was that DocValues
 eliminates the need for a field-level cache on this facet field.

 24GB of RAM to cache 175GB is probably not enough in the general case,
 but if you're seeing very little disk I/O activity for this query, then
 we'll leave that alone and you can worry about it later.

 What I would try immediately is setting the facet.method parameter to
 enum and seeing what that does to the facet time.  I've had good luck
 generally with that, even in situations where the docs indicated that
 the default (fc) was supposed to work better.  I have never explored the
 relationship between facet.method and docValues, though.

 I'm out of ideas after this.  I don't have enough experience with
 faceting to help much.

 Thanks,
 Shawn




Re: Best way to implement Spotlight of certain results

2015-01-13 Thread Dan Davis
Maybe I can use grouping, but my understanding of the feature is not up to
figuring that out :)

I tried something like

http://localhost:8983/solr/collection/select?q=childhood+cancergroup=ongroup.query=childhood+cancer
Because the group.limit=1, I get a single result, and no other results.
If I add group.field=title, then I get each result, in a group of 1
member...

Eric's re-ranking I do understand - I can re-rank the top-N to make sure
the spotlighted result is always first, avoiding the potential problem of
having to overweight the title field.In practice, I may not ever need
to use the reranking, but its there if I need it.This is enough,
because it gives me talking points.


On Fri, Jan 9, 2015 at 3:05 PM, Michał B. . m.bienkow...@gmail.com wrote:

 Maybe I understand you badly but I thing that you could use grouping to
 achieve such effect. If you could prepare two group queries one with exact
 match and other, let's say, default than you will be able to extract
 matches from grouping results. i.e (using default solr example collection)


 http://localhost:8983/solr/collection1/select?q=*:*group=truegroup.query=manu%3A%22Ap+Computer+Inc.%22group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Blackgroup.limit=10

 this query will return two groups one with exact match second with the rest
 standard results.

 Regars,
 Michal


 2015-01-09 20:44 GMT+01:00 Erick Erickson erickerick...@gmail.com:

  Hmm, I wonder if the RerankingQueryParser might help here?
  See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
 
  Best,
  Erick
 
  On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis dansm...@gmail.com wrote:
   I have a requirement to spotlight certain results if the query text
  exactly
   matches the title or see reference (indexed by me as alttitle_t).
   What that means is that these matching results are shown above the
   top-10/20 list with different CSS and fields.   Its like feeling lucky
 on
   google :)
  
   I have considered three ways of implementing this:
  
  1. Assume that edismax qf/pf will boost these results to be first
 when
  there is an exact match on these important fields.   The downside
  then is
  that my relevancy is constrained and I must maintain my
 configuration
  with
  title and alttitle_t as top search fields (see XML snippet below).
  I may
  have to overweight them to achieve the always first criteria.
   Another
  less major downside is that I must always return the spotlight
 summary
  field (for display) and the image to display on each search.   These
  could
  be got from a database by the id, however, it is convenient to get
  them
  from Solr.
  2. Issue two searches for every user search, and use a second set of
  parameters (change the search type and fields to search only by
 exact
  matching a specific string field spottitle_s).   The search for the
  spotlight can then have its own configuration.   The downside here
 is
  that
  I am using Django and pysolr for the front-end, and pysolr is both
  synchronous and tied to the requestHandler named select.
   Convention.
  Of course, running in parallel is not a fix-all - running a search
  takes
  some time, even if run in parallel.
  3. Automate the population of elevate.xml so that all these 959
  queries
  are here.   This is probably best, but forces me to restart/reload
  when
  there are changes to this components.   The elevation can be done
  through a
  query.
  
   What I'd love to do is to configure the select requestHandler to run
  both
   searches and return me both sets of results.   Is there anyway to do
  that -
   apply the same q= parameter to two configured way to run a search?
   Something like sub queries?
  
   I suspect that approach 1 will get me through my demo and a brief
   evaluation period, but that either approach 2 or 3 will be the winner.
  
   Here's a snippet from my current qf/pf configuration:
 str name=qf
   title^100
   alttitle_t^100
   ...
   text
 /str
 str name=pf
   title^1000
   alttitle_t^1000
   ...
   text^10
/str
  
   Thanks,
  
   Dan Davis
 



 --
 Michał Bieńkowski



Re: Occasionally getting error in solr suggester component.

2015-01-13 Thread Michael Sokolov
I think you are probably getting bitten by one of the issues addressed 
in LUCENE-5889


I would recommend against using buildOnCommit=true - with a large index 
this can be a performance-killer.  Instead, build the index yourself 
using the Solr spellchecker support (spellcheck.build=true)


-Mike

On 01/13/2015 10:41 AM, Dhanesh Radhakrishnan wrote:

Hi all,

I am experiencing a problem in Solr SuggestComponent
Occasionally solr suggester component throws an  error like

Solr failed:
{responseHeader:{status:500,QTime:1},error:{msg:suggester was
not built,trace:java.lang.IllegalStateException: suggester was not
built\n\tat
org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.lookup(AnalyzingInfixSuggester.java:368)\n\tat
org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.lookup(AnalyzingInfixSuggester.java:342)\n\tat
org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat
org.apache.solr.spelling.suggest.SolrSuggester.getSuggestions(SolrSuggester.java:199)\n\tat
org.apache.solr.handler.component.SuggestComponent.process(SuggestComponent.java:234)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)\n\tat
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)\n\tat
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)\n\tat
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)\n\tat
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)\n\tat
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)\n\tat
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)\n\tat
org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680)\n\tat
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)\n\tat
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)\n\tat
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1002)\n\tat
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)\n\tat
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:745)\n,code:500}}

This is not freequently happening, but idexing and suggestor component
working togethere  this error will occur.




In solr config

searchComponent name=suggest class=solr.SuggestComponent
 lst name=suggester
   str name=namehaSuggester/str
   str name=lookupImplAnalyzingInfixLookupFactory/str  !--
org.apache.solr.spelling.suggest.fst --
   str name=suggestAnalyzerFieldTypetextSpell/str
   str name=dictionaryImplDocumentDictionaryFactory/str !--
org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --
   str name=fieldname/str
   str name=weightFieldpackageWeight/str
   str name=buildOnCommittrue/str
 /lst
   /searchComponent

   requestHandler name=/suggest class=solr.SearchHandler startup=lazy
 lst name=defaults
   str name=suggesttrue/str
   str name=suggest.count10/str
 /lst
 arr name=components
   strsuggest/str
 /arr
   /requestHandler

Can any one suggest where to look to figure out this error and why these
errors are occurring?



Thanks,
dhanesh s.r




--





Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe
Just a side question. In your first example you have dates set with time
but in the second (where you set intervals) time is not set.
Is this something that can be resolved having a field that only sets date
(without time), and then use regular field faceting and facet.sort=index?
If that's possible in your use case that may be faster.

Tomás

On Tue, Jan 13, 2015 at 11:12 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 No, you are not misreading, right now there is no automatic way of
 generating the intervals on the server side similar to range faceting... I
 guess it won't work in your case. Maybe you should create a Jira to add
 this feature to interval faceting.

 Tomás

 On Tue, Jan 13, 2015 at 10:44 AM, David Smith 
 dsmiths...@yahoo.com.invalid wrote:

 Tomás,


 Thanks for the response -- the performance of my query makes perfect
 sense in light of your information.
 I looked at Interval faceting.  My required interval is 1 day.  I cannot
 change that requirement.  Unless I am mis-reading the doc, that means to
 facet a 10 year range, the query needs to specify over 3,600 intervals ??


 f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc


 Each query would be 185MB in size if I structure it this way.

 I assume I must be mis-understanding how to use Interval faceting with
 dates.  Are there any concrete examples you know of?  A google search did
 not come up with much.

 Kind regards,
 Dave

  On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:


  Range Faceting won't use the DocValues even if they are there set, it
 translates each gap to a filter. This means that it will end up using the
 FilterCache, which should cause faster followup queries if you repeat the
 same gaps (and don't commit).
 You may also want to try interval faceting, it will use DocValues instead
 of filters. The API is different, you'll have to provide the intervals
 yourself.

 Tomás

 On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
 wrote:

  On 1/13/2015 10:35 AM, David Smith wrote:
   I have a query against a single 50M doc index (175GB) using Solr
 4.10.2,
  that exhibits the following response times (via the debugQuery option in
  Solr Admin):
   process: {
time: 24709,
query: { time: 54 }, facet: { time: 24574 },
  
  
   The query time of 54ms is great and exactly as expected -- this
 example
  was a single-term search that returned 3 hits.
   I am trying to get the facet time (24.5 seconds) to be sub-second, and
  am having no luck.  The facet part of the query is as follows:
  
   params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,
  
rows: 10,
  
f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
  
f.eventDate.facet.mincount: 1,
  
facet: true,
  
debugQuery: true,
_: 1421169383802
}
  
   And, the relevant schema definition is as follows:
  
  field name=eventDate type=tdate indexed=true stored=true
  multiValued=false docValues=true/
  
  !-- A Trie based date field for faster date range queries and date
  faceting. --
  fieldType name=tdate class=solr.TrieDateField
 precisionStep=6
  positionIncrementGap=0/
  
  
   During the 25-second query, the Solr JVM pegs one CPU, with little or
 no
  I/O activity detected on the drive that holds the 175GB index.  I have
 48GB
  of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
  
   I do NOT have any fieldValue caches configured as yet, because my
  (perhaps too simplistic?) reading of the documentation was that
 DocValues
  eliminates the need for a field-level cache on this facet field.
 
  24GB of RAM to cache 175GB is probably not enough in the general case,
  but if you're seeing very little disk I/O activity for this query, then
  we'll leave that alone and you can worry about it later.
 
  What I would try immediately is setting the facet.method parameter to
  enum and seeing what that does to the facet time.  I've had good luck
  generally with that, even in situations where the docs indicated that
  the default (fc) was supposed to work better.  I have never explored the
  relationship between facet.method and docValues, though.
 
  I'm out of ideas after this.  I don't have enough experience with
  faceting to help much.
 
  Thanks,
  Shawn
 
 







Re: Unexplained leader initiated recovery after updates - SolrCmdDistributor no longer retries on RemoteSolrException

2015-01-13 Thread Lindsay Martin
We are experiencing unexpected recovery events when a leader is sending
updates to a replica. A java.net.SocketException: Connection reset² is
encountered when updating the replica which triggers the recovery.

In our previous Solr 4.6.1 installation, update errors triggered retry
logic in the SolrCmdDistributor and the updates continued without
triggering a leader initialized recovery.


In our current 4.10.2 installation, this retry logic no longer occurs.


It looks like the fix for https://issues.apache.org/jira/browse/SOLR-5509
removed this retry logic. See
https://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apach
e/solr/update/SolrCmdDistributor.java?r1=1546672r2=1546164pathrev=1546672
 . This change was introduced with Solr 4.7.

The commit to remove the retry logic appears to have been removed when
investigating an unstable test. I am wondering if the retry logic should
be restored for production use.

Should I open a ticket to restore the retry logic?

Thanks,

Lindsay 

On 2015-01-12, 5:36 PM, Lindsay Martin lmar...@abebooks.com wrote:

I have uncovered some additional details in the shard leader log:

2015-01-11 09:38:00.693 [qtp268575911-3617101] INFO
org.apache.solr.update.processor.LogUpdateProcessor  ­ [listings]
webapp=/solr path=/update
params{distrib.from=http://solr05.search.abebooks.com:8983/solr/listings/;
u
pdate.distrib=TOLEADERwt=javabinversion=2} {add=[14065572860
(1490024273004199936)]} 0 707
2015-01-11 09:38:00.913 [updateExecutor-1-thread-35734] ERROR
org.apache.solr.update.StreamingSolrServers  ­ error
java.net.SocketException: Connection reset

snip



Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Tomás,


Thanks for the response -- the performance of my query makes perfect sense in 
light of your information.
I looked at Interval faceting.  My required interval is 1 day.  I cannot change 
that requirement.  Unless I am mis-reading the doc, that means to facet a 10 
year range, the query needs to specify over 3,600 intervals ??

f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc
 

Each query would be 185MB in size if I structure it this way.

I assume I must be mis-understanding how to use Interval faceting with dates.  
Are there any concrete examples you know of?  A google search did not come up 
with much.

Kind regards,
Dave

 On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:
   

 Range Faceting won't use the DocValues even if they are there set, it
translates each gap to a filter. This means that it will end up using the
FilterCache, which should cause faster followup queries if you repeat the
same gaps (and don't commit).
You may also want to try interval faceting, it will use DocValues instead
of filters. The API is different, you'll have to provide the intervals
yourself.

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/13/2015 10:35 AM, David Smith wrote:
  I have a query against a single 50M doc index (175GB) using Solr 4.10.2,
 that exhibits the following response times (via the debugQuery option in
 Solr Admin):
  process: {
   time: 24709,
   query: { time: 54 }, facet: { time: 24574 },
 
 
  The query time of 54ms is great and exactly as expected -- this example
 was a single-term search that returned 3 hits.
  I am trying to get the facet time (24.5 seconds) to be sub-second, and
 am having no luck.  The facet part of the query is as follows:
 
  params: { facet.range: eventDate,
   f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
   f.eventDate.facet.range.gap: +1DAY,
   start: 0,
 
   rows: 10,
 
   f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
 
   f.eventDate.facet.mincount: 1,
 
   facet: true,
 
   debugQuery: true,
   _: 1421169383802
   }
 
  And, the relevant schema definition is as follows:
 
     field name=eventDate type=tdate indexed=true stored=true
 multiValued=false docValues=true/
 
     !-- A Trie based date field for faster date range queries and date
 faceting. --
     fieldType name=tdate class=solr.TrieDateField precisionStep=6
 positionIncrementGap=0/
 
 
  During the 25-second query, the Solr JVM pegs one CPU, with little or no
 I/O activity detected on the drive that holds the 175GB index.  I have 48GB
 of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
 
  I do NOT have any fieldValue caches configured as yet, because my
 (perhaps too simplistic?) reading of the documentation was that DocValues
 eliminates the need for a field-level cache on this facet field.

 24GB of RAM to cache 175GB is probably not enough in the general case,
 but if you're seeing very little disk I/O activity for this query, then
 we'll leave that alone and you can worry about it later.

 What I would try immediately is setting the facet.method parameter to
 enum and seeing what that does to the facet time.  I've had good luck
 generally with that, even in situations where the docs indicated that
 the default (fc) was supposed to work better.  I have never explored the
 relationship between facet.method and docValues, though.

 I'm out of ideas after this.  I don't have enough experience with
 faceting to help much.

 Thanks,
 Shawn



   

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe
No, you are not misreading, right now there is no automatic way of
generating the intervals on the server side similar to range faceting... I
guess it won't work in your case. Maybe you should create a Jira to add
this feature to interval faceting.

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid
wrote:

 Tomás,


 Thanks for the response -- the performance of my query makes perfect sense
 in light of your information.
 I looked at Interval faceting.  My required interval is 1 day.  I cannot
 change that requirement.  Unless I am mis-reading the doc, that means to
 facet a 10 year range, the query needs to specify over 3,600 intervals ??


 f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc


 Each query would be 185MB in size if I structure it this way.

 I assume I must be mis-understanding how to use Interval faceting with
 dates.  Are there any concrete examples you know of?  A google search did
 not come up with much.

 Kind regards,
 Dave

  On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:


  Range Faceting won't use the DocValues even if they are there set, it
 translates each gap to a filter. This means that it will end up using the
 FilterCache, which should cause faster followup queries if you repeat the
 same gaps (and don't commit).
 You may also want to try interval faceting, it will use DocValues instead
 of filters. The API is different, you'll have to provide the intervals
 yourself.

 Tomás

 On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
 wrote:

  On 1/13/2015 10:35 AM, David Smith wrote:
   I have a query against a single 50M doc index (175GB) using Solr
 4.10.2,
  that exhibits the following response times (via the debugQuery option in
  Solr Admin):
   process: {
time: 24709,
query: { time: 54 }, facet: { time: 24574 },
  
  
   The query time of 54ms is great and exactly as expected -- this example
  was a single-term search that returned 3 hits.
   I am trying to get the facet time (24.5 seconds) to be sub-second, and
  am having no luck.  The facet part of the query is as follows:
  
   params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,
  
rows: 10,
  
f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
  
f.eventDate.facet.mincount: 1,
  
facet: true,
  
debugQuery: true,
_: 1421169383802
}
  
   And, the relevant schema definition is as follows:
  
  field name=eventDate type=tdate indexed=true stored=true
  multiValued=false docValues=true/
  
  !-- A Trie based date field for faster date range queries and date
  faceting. --
  fieldType name=tdate class=solr.TrieDateField precisionStep=6
  positionIncrementGap=0/
  
  
   During the 25-second query, the Solr JVM pegs one CPU, with little or
 no
  I/O activity detected on the drive that holds the 175GB index.  I have
 48GB
  of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
  
   I do NOT have any fieldValue caches configured as yet, because my
  (perhaps too simplistic?) reading of the documentation was that DocValues
  eliminates the need for a field-level cache on this facet field.
 
  24GB of RAM to cache 175GB is probably not enough in the general case,
  but if you're seeing very little disk I/O activity for this query, then
  we'll leave that alone and you can worry about it later.
 
  What I would try immediately is setting the facet.method parameter to
  enum and seeing what that does to the facet time.  I've had good luck
  generally with that, even in situations where the docs indicated that
  the default (fc) was supposed to work better.  I have never explored the
  relationship between facet.method and docValues, though.
 
  I'm out of ideas after this.  I don't have enough experience with
  faceting to help much.
 
  Thanks,
  Shawn
 
 





Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Chris Hostetter

: assuming I've written the subclass of the postings format, I need to tell
: Solr to use it.
: 
: Do I just do something like:
: 
: fieldType name=ocr class=solr.TextField postingsFormat=MySubclass /

the postingFormat xml tag in schema.xml just refers to the name of the 
postingFormat in SPI -- which is discussed in the PostingFormat 
javadocs...

https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/PostingsFormat.html

...the nuts  bolts of it is that the PostingFormat baseclass should take 
care of all the SPI name registration that you need based on what you 
pass to the super() construction ... allthough now that i think about it, 
i'm not sure how you'd go about specifying your own name for the 
PostingFormat when also doing something like subclassing 
Lucene41PostingsFormat ... there's no Lucene41PostingsFormat constructor 
you can call from your subclass to override the name.

not sure what the expectation is there in the java API.


: Is there a way to set this for all fieldtypes or would that require writing
: a custom CodecFactory?

SchemaCodecFactory uses the Lucene default for any fieldType that doesn't 
define it's own postingFormat -- so if you wnated to change the 
postingFormat or *every* fieldType, then yes: you'd need to override the 
CodecFactory itself.



-Hoss
http://www.lucidworks.com/


Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
What is stumping me is that the search result has 3 hits, yet faceting those 3 
hits takes 24 seconds.  The documentation for facet.method=fc is quite explicit 
about how Solr does faceting:


fc (stands for Field Cache) The facet counts are calculated by iterating over 
documents that match the query and summing the terms that appear in each 
document. This was the default method for single valued fields prior to Solr 
1.4.

If a search yielded millions of hits, I could understand 24 seconds to 
calculate the facets.  But not for a search with only 3 hits.  


What am I missing?  

Regards,
David



 

 On Tuesday, January 13, 2015 1:12 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:
   

 No, you are not misreading, right now there is no automatic way of
generating the intervals on the server side similar to range faceting... I
guess it won't work in your case. Maybe you should create a Jira to add
this feature to interval faceting.

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid
wrote:

 Tomás,


 Thanks for the response -- the performance of my query makes perfect sense
 in light of your information.
 I looked at Interval faceting.  My required interval is 1 day.  I cannot
 change that requirement.  Unless I am mis-reading the doc, that means to
 facet a 10 year range, the query needs to specify over 3,600 intervals ??


 f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc


 Each query would be 185MB in size if I structure it this way.

 I assume I must be mis-understanding how to use Interval faceting with
 dates.  Are there any concrete examples you know of?  A google search did
 not come up with much.

 Kind regards,
 Dave

      On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:


  Range Faceting won't use the DocValues even if they are there set, it
 translates each gap to a filter. This means that it will end up using the
 FilterCache, which should cause faster followup queries if you repeat the
 same gaps (and don't commit).
 You may also want to try interval faceting, it will use DocValues instead
 of filters. The API is different, you'll have to provide the intervals
 yourself.

 Tomás

 On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
 wrote:

  On 1/13/2015 10:35 AM, David Smith wrote:
   I have a query against a single 50M doc index (175GB) using Solr
 4.10.2,
  that exhibits the following response times (via the debugQuery option in
  Solr Admin):
   process: {
    time: 24709,
    query: { time: 54 }, facet: { time: 24574 },
  
  
   The query time of 54ms is great and exactly as expected -- this example
  was a single-term search that returned 3 hits.
   I am trying to get the facet time (24.5 seconds) to be sub-second, and
  am having no luck.  The facet part of the query is as follows:
  
   params: { facet.range: eventDate,
    f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
    f.eventDate.facet.range.gap: +1DAY,
    start: 0,
  
    rows: 10,
  
    f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
  
    f.eventDate.facet.mincount: 1,
  
    facet: true,
  
    debugQuery: true,
    _: 1421169383802
    }
  
   And, the relevant schema definition is as follows:
  
      field name=eventDate type=tdate indexed=true stored=true
  multiValued=false docValues=true/
  
      !-- A Trie based date field for faster date range queries and date
  faceting. --
      fieldType name=tdate class=solr.TrieDateField precisionStep=6
  positionIncrementGap=0/
  
  
   During the 25-second query, the Solr JVM pegs one CPU, with little or
 no
  I/O activity detected on the drive that holds the 175GB index.  I have
 48GB
  of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
  
   I do NOT have any fieldValue caches configured as yet, because my
  (perhaps too simplistic?) reading of the documentation was that DocValues
  eliminates the need for a field-level cache on this facet field.
 
  24GB of RAM to cache 175GB is probably not enough in the general case,
  but if you're seeing very little disk I/O activity for this query, then
  we'll leave that alone and you can worry about it later.
 
  What I would try immediately is setting the facet.method parameter to
  enum and seeing what that does to the facet time.  I've had good luck
  generally with that, even in situations where the docs indicated that
  the default (fc) was supposed to work better.  I have never explored the
  relationship between facet.method and docValues, though.
 
  I'm out of ideas after this.  I don't have enough experience with
  faceting to help much.
 
  Thanks,
  Shawn
 
 




   

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe
fc, fcs and enum only apply for field faceting, not range faceting.

Tomás

On Tue, Jan 13, 2015 at 11:24 AM, David Smith dsmiths...@yahoo.com.invalid
wrote:

 What is stumping me is that the search result has 3 hits, yet faceting
 those 3 hits takes 24 seconds.  The documentation for facet.method=fc is
 quite explicit about how Solr does faceting:


 fc (stands for Field Cache) The facet counts are calculated by iterating
 over documents that match the query and summing the terms that appear in
 each document. This was the default method for single valued fields prior
 to Solr 1.4.

 If a search yielded millions of hits, I could understand 24 seconds to
 calculate the facets.  But not for a search with only 3 hits.


 What am I missing?

 Regards,
 David





  On Tuesday, January 13, 2015 1:12 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:


  No, you are not misreading, right now there is no automatic way of
 generating the intervals on the server side similar to range faceting... I
 guess it won't work in your case. Maybe you should create a Jira to add
 this feature to interval faceting.

 Tomás

 On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid
 
 wrote:

  Tomás,
 
 
  Thanks for the response -- the performance of my query makes perfect
 sense
  in light of your information.
  I looked at Interval faceting.  My required interval is 1 day.  I cannot
  change that requirement.  Unless I am mis-reading the doc, that means to
  facet a 10 year range, the query needs to specify over 3,600 intervals ??
 
 
 
 f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc
 
 
  Each query would be 185MB in size if I structure it this way.
 
  I assume I must be mis-understanding how to use Interval faceting with
  dates.  Are there any concrete examples you know of?  A google search did
  not come up with much.
 
  Kind regards,
  Dave
 
   On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
 
   Range Faceting won't use the DocValues even if they are there set, it
  translates each gap to a filter. This means that it will end up using the
  FilterCache, which should cause faster followup queries if you repeat the
  same gaps (and don't commit).
  You may also want to try interval faceting, it will use DocValues instead
  of filters. The API is different, you'll have to provide the intervals
  yourself.
 
  Tomás
 
  On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
  wrote:
 
   On 1/13/2015 10:35 AM, David Smith wrote:
I have a query against a single 50M doc index (175GB) using Solr
  4.10.2,
   that exhibits the following response times (via the debugQuery option
 in
   Solr Admin):
process: {
 time: 24709,
 query: { time: 54 }, facet: { time: 24574 },
   
   
The query time of 54ms is great and exactly as expected -- this
 example
   was a single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second,
 and
   am having no luck.  The facet part of the query is as follows:
   
params: { facet.range: eventDate,
 f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
 f.eventDate.facet.range.gap: +1DAY,
 start: 0,
   
 rows: 10,
   
 f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
   
 f.eventDate.facet.mincount: 1,
   
 facet: true,
   
 debugQuery: true,
 _: 1421169383802
 }
   
And, the relevant schema definition is as follows:
   
   field name=eventDate type=tdate indexed=true stored=true
   multiValued=false docValues=true/
   
   !-- A Trie based date field for faster date range queries and
 date
   faceting. --
   fieldType name=tdate class=solr.TrieDateField
 precisionStep=6
   positionIncrementGap=0/
   
   
During the 25-second query, the Solr JVM pegs one CPU, with little or
  no
   I/O activity detected on the drive that holds the 175GB index.  I have
  48GB
   of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
   
I do NOT have any fieldValue caches configured as yet, because my
   (perhaps too simplistic?) reading of the documentation was that
 DocValues
   eliminates the need for a field-level cache on this facet field.
  
   24GB of RAM to cache 175GB is probably not enough in the general case,
   but if you're seeing very little disk I/O activity for this query, then
   we'll leave that alone and you can worry about it later.
  
   What I would try immediately is setting the facet.method parameter to
   enum and seeing what that does to the facet time.  I've had good luck
   generally with that, even in situations where the docs indicated that
   the default (fc) was supposed to work better.  I have never explored
 the
   relationship between facet.method and docValues, though.
  
   I'm out of ideas after this.  I 

Re: Solr grouping problem - need help

2015-01-13 Thread Erick Erickson
bq: My question is for indexed=false, stored=true field..what is optimized way
to get unique values in such field.

There isn't any. To do this you'll have to read the doc from disk,
it'll be decompressed
along the way and then the field is read. Note that this happens
automatically when
you call doc.getFieldValue or similar.

At the stored=true level, you're always talking about complete documents.
indexed=true is about putting the field data into efficient-access structures.
They're completely different beasts.

your original question was:
Please guide me how i can tell solr not to tokenize stored field to decide
unique groups..

Simply declare the field type you care about as a string type in
schema.xml. The use a copyFeld directive to copy the data to the
new type, and group on the new field.

There are examples in the schema.xml of string types and copyFields that
should help.

Best,
Erick

On Tue, Jan 13, 2015 at 9:00 AM, Naresh Yadav nyadav@gmail.com wrote:
 Erick, my schema is same no change in that..
 *Schema :*
 field name=tenant_pool type=text stored=true/
 my guess is i had not mentioned indexed true or falsemay be default
 indexed is true

 My question is for indexed=false, stored=true field..what is optimized way
 to get unique values in such field..

 On Tue, Jan 13, 2015 at 10:07 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Something is very wrong here. Have you perhaps been changing your
 schema without re-indexing? And I recommend you completely remove
 your data directory (the one with index and tlog subdirectories) after
 you change your schema.xml file.

 Because you're trying to group on a field that is _not_ indexed, you
 should be getting an error returned, something like:
 can not use FieldCache on a field which is neither indexed nor has
 doc values: 

 As far as the tokenization comment is, just start by making the field you
 want
 to group on be
 stored=false indexed=true type=string

 Best,
 Erick

 On Tue, Jan 13, 2015 at 5:09 AM, Naresh Yadav nyadav@gmail.com
 wrote:
  Hi jack,
 
  Thanks for replying, i am new to solr please guide me on this. I have
 many
  such columns in my schema
  so copy field will create lot of duplicate fields beside i do not need
 any
  search on original field.
 
  My usecase is i do not want any search on tenant_pool field thats why i
  declared it as stored field not indexed.
  I just need to get unique values in this field. Please show some
 direction.
 
 
  On Tue, Jan 13, 2015 at 6:16 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
  That's your job. The easiest way is to do a copyField to a string
 field.
 
  -- Jack Krupansky
 
  On Tue, Jan 13, 2015 at 7:33 AM, Naresh Yadav nyadav@gmail.com
  wrote:
 
   *Schema :*
   field name=tenant_pool type=text stored=true/
  
   *Code :*
   SolrQuery q = new SolrQuery().setQuery(*:*);
   q.set(GroupParams.GROUP, true);
   q.set(GroupParams.GROUP_FIELD, tenant_pool);
  
   *Data :*
   tenant_pool : Baroda Farms
   tenant_pool : Ketty Farms
  
   *Output coming :*
   groupValue=Farms, docs=2
  
   *Expected Output :*
   groupValue=Baroda Farms, docs=1
   groupValue=Ketty Farms, docs=1
  
   Please guide me how i can tell solr not to tokenize stored field to
  decide
   unique groups..
  
   I want unique groups as exact value of field not the tokens which
 solr is
   doing
   currently.
  
   Thanks
   Naresh
  
 
 
 
 
  --
  Cheers,
 
  Naresh Yadav
  +919960523401
  http://nareshyadav.blogspot.com/
  SSE, MetrixLine Inc.



RE: Distributed unit tests and SSL doesn't have a valid keystore

2015-01-13 Thread Markus Jelsma
Thanks, we will supress it for now! 
M. 
 
-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Monday 12th January 2015 19:25
 To: solr-user@lucene.apache.org
 Subject: Re: Distributed unit tests and SSL doesn't have a valid keystore
 
 I'd have to do some digging. Hossman might know offhand. You might just
 want to use @SupressSSL on the tests :)
 
 - Mark
 
 On Mon Jan 12 2015 at 8:45:11 AM Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  Hi - in a small Maven project depending on Solr 4.10.3, running unit tests
  that extend BaseDistributedSearchTestCase randomly fail with SSL doesn't
  have a valid keystore, and a lot of zombie threads. We have a
  solrtest.keystore file laying around, but where to put it?
 
  Thanks,
  Markus
 
 


Re: Slow faceting performance on a docValues field

2015-01-13 Thread Alexandre Rafalovitch
Could probably write a custom SearchComponent to prepend and expand
the query for the required use case. Though if something then has to
parse that query back, it would still be an issue.

Regards,
 Alex

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 13 January 2015 at 14:12, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:
 No, you are not misreading, right now there is no automatic way of
 generating the intervals on the server side similar to range faceting... I
 guess it won't work in your case. Maybe you should create a Jira to add
 this feature to interval faceting.

 Tomás

 On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid
 wrote:

 Tomás,


 Thanks for the response -- the performance of my query makes perfect sense
 in light of your information.
 I looked at Interval faceting.  My required interval is 1 day.  I cannot
 change that requirement.  Unless I am mis-reading the doc, that means to
 facet a 10 year range, the query needs to specify over 3,600 intervals ??


 f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc


 Each query would be 185MB in size if I structure it this way.

 I assume I must be mis-understanding how to use Interval faceting with
 dates.  Are there any concrete examples you know of?  A google search did
 not come up with much.

 Kind regards,
 Dave

  On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:


  Range Faceting won't use the DocValues even if they are there set, it
 translates each gap to a filter. This means that it will end up using the
 FilterCache, which should cause faster followup queries if you repeat the
 same gaps (and don't commit).
 You may also want to try interval faceting, it will use DocValues instead
 of filters. The API is different, you'll have to provide the intervals
 yourself.

 Tomás

 On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
 wrote:

  On 1/13/2015 10:35 AM, David Smith wrote:
   I have a query against a single 50M doc index (175GB) using Solr
 4.10.2,
  that exhibits the following response times (via the debugQuery option in
  Solr Admin):
   process: {
time: 24709,
query: { time: 54 }, facet: { time: 24574 },
  
  
   The query time of 54ms is great and exactly as expected -- this example
  was a single-term search that returned 3 hits.
   I am trying to get the facet time (24.5 seconds) to be sub-second, and
  am having no luck.  The facet part of the query is as follows:
  
   params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,
  
rows: 10,
  
f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
  
f.eventDate.facet.mincount: 1,
  
facet: true,
  
debugQuery: true,
_: 1421169383802
}
  
   And, the relevant schema definition is as follows:
  
  field name=eventDate type=tdate indexed=true stored=true
  multiValued=false docValues=true/
  
  !-- A Trie based date field for faster date range queries and date
  faceting. --
  fieldType name=tdate class=solr.TrieDateField precisionStep=6
  positionIncrementGap=0/
  
  
   During the 25-second query, the Solr JVM pegs one CPU, with little or
 no
  I/O activity detected on the drive that holds the 175GB index.  I have
 48GB
  of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
  
   I do NOT have any fieldValue caches configured as yet, because my
  (perhaps too simplistic?) reading of the documentation was that DocValues
  eliminates the need for a field-level cache on this facet field.
 
  24GB of RAM to cache 175GB is probably not enough in the general case,
  but if you're seeing very little disk I/O activity for this query, then
  we'll leave that alone and you can worry about it later.
 
  What I would try immediately is setting the facet.method parameter to
  enum and seeing what that does to the facet time.  I've had good luck
  generally with that, even in situations where the docs indicated that
  the default (fc) was supposed to work better.  I have never explored the
  relationship between facet.method and docValues, though.
 
  I'm out of ideas after this.  I don't have enough experience with
  faceting to help much.
 
  Thanks,
  Shawn
 
 





Re: Occasionally getting error in solr suggester component.

2015-01-13 Thread Dan Davis
Related question -

I see mention of needing to rebuild the spellcheck/suggest dictionary after
solr core reload.   I see spellcheckIndexDir in both the old wiki entry and
the solr reference guide
https://cwiki.apache.org/confluence/display/solr/Spell+Checking.  If this
parameter is provided, it sounds like the index is stored on the filesystem
and need not be rebuilt each time the core is reloaded.

Is this a correct understanding?


On Tue, Jan 13, 2015 at 2:17 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 I think you are probably getting bitten by one of the issues addressed in
 LUCENE-5889

 I would recommend against using buildOnCommit=true - with a large index
 this can be a performance-killer.  Instead, build the index yourself using
 the Solr spellchecker support (spellcheck.build=true)

 -Mike


 On 01/13/2015 10:41 AM, Dhanesh Radhakrishnan wrote:

 Hi all,

 I am experiencing a problem in Solr SuggestComponent
 Occasionally solr suggester component throws an  error like

 Solr failed:
 {responseHeader:{status:500,QTime:1},error:{msg:suggester was
 not built,trace:java.lang.IllegalStateException: suggester was not
 built\n\tat
 org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
 lookup(AnalyzingInfixSuggester.java:368)\n\tat
 org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
 lookup(AnalyzingInfixSuggester.java:342)\n\tat
 org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat
 org.apache.solr.spelling.suggest.SolrSuggester.
 getSuggestions(SolrSuggester.java:199)\n\tat
 org.apache.solr.handler.component.SuggestComponent.
 process(SuggestComponent.java:234)\n\tat
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(
 SearchHandler.java:218)\n\tat
 org.apache.solr.handler.RequestHandlerBase.handleRequest(
 RequestHandlerBase.java:135)\n\tat
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
 handleRequest(RequestHandlers.java:246)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.execute(
 SolrDispatchFilter.java:777)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:418)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:207)\n\tat
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
 ApplicationFilterChain.java:243)\n\tat
 org.apache.catalina.core.ApplicationFilterChain.doFilter(
 ApplicationFilterChain.java:210)\n\tat
 org.apache.catalina.core.StandardWrapperValve.invoke(
 StandardWrapperValve.java:225)\n\tat
 org.apache.catalina.core.StandardContextValve.invoke(
 StandardContextValve.java:123)\n\tat
 org.apache.catalina.core.StandardHostValve.invoke(
 StandardHostValve.java:168)\n\tat
 org.apache.catalina.valves.ErrorReportValve.invoke(
 ErrorReportValve.java:98)\n\tat
 org.apache.catalina.valves.AccessLogValve.invoke(
 AccessLogValve.java:927)\n\tat
 org.apache.catalina.valves.RemoteIpValve.invoke(
 RemoteIpValve.java:680)\n\tat
 org.apache.catalina.core.StandardEngineValve.invoke(
 StandardEngineValve.java:118)\n\tat
 org.apache.catalina.connector.CoyoteAdapter.service(
 CoyoteAdapter.java:407)\n\tat
 org.apache.coyote.http11.AbstractHttp11Processor.process(
 AbstractHttp11Processor.java:1002)\n\tat
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
 process(AbstractProtocol.java:579)\n\tat
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.
 run(JIoEndpoint.java:312)\n\tat
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)\n\tat
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)\n\tat
 java.lang.Thread.run(Thread.java:745)\n,code:500}}

 This is not freequently happening, but idexing and suggestor component
 working togethere  this error will occur.




 In solr config

 searchComponent name=suggest class=solr.SuggestComponent
  lst name=suggester
str name=namehaSuggester/str
str name=lookupImplAnalyzingInfixLookupFactory/str  !--
 org.apache.solr.spelling.suggest.fst --
str name=suggestAnalyzerFieldTypetextSpell/str
str name=dictionaryImplDocumentDictionaryFactory/str
  !--
 org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --
str name=fieldname/str
str name=weightFieldpackageWeight/str
str name=buildOnCommittrue/str
  /lst
/searchComponent

requestHandler name=/suggest class=solr.SearchHandler
 startup=lazy
  lst name=defaults
str name=suggesttrue/str
str name=suggest.count10/str
  /lst
  arr name=components
strsuggest/str
  /arr
/requestHandler

 Can any one suggest where to look to figure out this error and why these
 errors are occurring?



 Thanks,
 dhanesh s.r




 --





Re: Frequent deletions

2015-01-13 Thread Shawn Heisey
On 1/13/2015 12:10 AM, ig01 wrote:
 Unfortunately this is the case, we do have hundreds of millions of documents
 on one 
 Solr instance/server. All our configs and schema are with default
 configurations. Our index
 size is 180G, does that mean that we need at least 180G heap size?

If you have hundreds of millions of documents and the index is only
180GB, they must be REALLY tiny documents.

The number of documents has a lot more impact on the heap requirements
than the index size on disk.  As described in my previous email, I have
about 130GB of total index on my dev Solr server, and the heap is only
7GB.  Everything I ask that machine to do, which includes optimizing
shards that are up to 20GB each, works flawlessly.

When a Solr index has 500 million documents, the amount of memory
required to construct a single entry in the filterCache is over 60MB.
The size of the filterCache in the default example config is 512 ...
which means that if that cache ends up fully utilized, that's in the
neighborhood of 30GB of RAM required for just one Solr cache.  The
amount of memory required for the Lucene FieldCache could be insane with
500 million documents, depending on the exact nature of the queries that
you are doing.

The index size on disk has a different tie to memory -- the RAM that is
not allocated to programs is automatically used by the operating system
for caching data on the disk.  If you have plenty of RAM so the OS disk
cache can effectively keep relevant parts of the index in memory,
performance will not suffer.  Anytime Solr must actually ask the disk
for index data, it will be slow.

With 120GB out of the 140GB total allocated to Solr, that leaves 20GB to
cache 180GB of index data.  That's almost certainly not enough.
Although the OS disk cache requirements have no direct correlation with
OOME exceptions, slow performance due to insufficient caching might lead
*indirectly* to OOME, because the slow performance means that it's more
likely you'll have many queries happening at the same time, which will
lead to larger heap requirements.

Thanks,
Shawn



Solr fails to start with log file not found error

2015-01-13 Thread Graeme Pietersz
I get this error when starting Solr using the script in bin/solr

tail cannot open `[path]/logs/solr.log’ for reading: No such file or directory

It does not happen every time, but it does happen a lot. It sometimes clears up 
after a while.

I have tried creating an empty file, but solr then just says:

Backing up [path]/logs/solr.log

And repeats the same error.

I am guessing the problem is that it cannot get the error from the log file 
because the log file has not been created yet, but then how do I debug this?

Running  Solr 4.10.2 on Debian 7 using Jetty with the default IcedTea 2.5.3 
java version 1.7.0_65

Thanks for any help or pointers.


Re: leader split-brain at least once a day - need help

2015-01-13 Thread Thomas Lamy

Hi Mark,

we're currently at 4.10.2, update to 4.10.3 ist scheduled for tomorrow.

T

Am 12.01.15 um 17:30 schrieb Mark Miller:

bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?

- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy t.l...@cytainment.de wrote:


Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found
no free solution to analyze them that worked out of the box on a
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
before) on one of the nodes, after checking allocation after 1 hour run
time was at about 2-3GB. That didn't move the time frame where a restart
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into
Splunk now, just to get a better sorted view of what's going on in the
cloud once a problem occurs. We're also enabling GC logging for
zookeeper; maybe we were missing problems there while focussing on solr
logs.

Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de

wrote:

Hi there,

we are running a 3 server cloud serving a dozen
single-shard/replicate-everywhere collections. The 2 biggest

collections are

~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,

Tomcat

7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import

once

a day starting at 1am. The second biggest collection is updated usind

DIH

delta-import every 10 minutes, the biggest one gets bulk json updates

with

commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request

says it

is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor;

ClusterState

says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores

going

to recovery failed state, or all cores of at least one cloud node into
state gone.
This started out of the blue about 2 weeks ago, without changes to

neither

software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the
current leader node, forcing a new election - can this be triggered

while

keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where

our

admins didn't restart in time, creating millions of entries in
/solr/oversser/queue, making zk close the connection, and leader

re-elect

fails. I had to flush zk, and re-upload collection config to get solr up
again (just like in https://gist.github.com/

isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,

1500

requests/s) up and running, which does not have these problems since
upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476





--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: Extending solr analysis in index time

2015-01-13 Thread Ali Nazemian
Dear Markus,

Unfortunately I can not use payload since I want to retrieve this score to
each user as a simple field alongside other fields. Unfortunately payload
does not provide that. Also I dont want to change the default similarity
method of Lucene, I just want to have this filed to do the sorting in some
cases.
Best regards.

On Mon, Jan 12, 2015 at 10:26 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - You mention having a list with important terms, then using payloads
 would be the most straightforward i suppose. You still need a custom
 similarity and custom query parser. Payloads work for us very well.

 M



 -Original message-
  From:Ahmet Arslan iori...@yahoo.com.INVALID
  Sent: Monday 12th January 2015 19:50
  To: solr-user@lucene.apache.org
  Subject: Re: Extending solr analysis in index time
 
  Hi Ali,
 
  Reading your example, if you could somehow replace idf component with
 your importance weight,
  I think your use case looks like TFIDFSimilarity. Tf component remains
 same.
 
 
 https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
 
  I also suggest you ask this in lucene mailing list. Someone familiar
 with similarity package can give insight on this.
 
  Ahmet
 
 
 
  On Monday, January 12, 2015 6:54 PM, Jack Krupansky 
 jack.krupan...@gmail.com wrote:
  Could you clarify what you mean by Lucene reverse index? That's not a
  term I am familiar with.
 
  -- Jack Krupansky
 
 
  On Mon, Jan 12, 2015 at 1:01 AM, Ali Nazemian alinazem...@gmail.com
 wrote:
 
   Dear Jack,
   Thank you very much.
   Yeah I was thinking of function query for sorting, but I have to
 problems
   in this case, 1) function query do the process at query time which I
 dont
   want to. 2) I also want to have the score field for retrieving and
 showing
   to users.
  
   Dear Alexandre,
   Here is some more explanation about the business behind the question:
   I am going to provide a field for each document, lets refer it as
   document_score. I am going to fill this field based on the
 information
   that could be extracted from Lucene reverse index. Assume I have a
 list of
   terms, called important terms and I am going to extract the term
 frequency
   for each of the terms inside this list per each document. To be honest
 I
   want to use the term frequency for calculating document_score.
   document_score should be storable since I am going to retrieve this
 field
   for each document. I also want to do sorting on document_store in
 case of
   preferred by user.
   I hope I did convey my point.
   Best regards.
  
  
   On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky 
 jack.krupan...@gmail.com
   
   wrote:
  
Won't function queries do the job at query time? You can add or
 multiply
the tf*idf score by a function of the term frequency of arbitrary
 terms,
using the tf, mul, and add functions.
   
See:
https://cwiki.apache.org/confluence/display/solr/Function+Queries
   
-- Jack Krupansky
   
On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian 
 alinazem...@gmail.com
wrote:
   
 Dear Jack,
 Hi,
 I think you misunderstood my need. I dont want to change the
 default
 scoring behavior of Lucene (tf-idf) I just want to have another
 field
   to
do
 sorting for some specific queries (not all the search business),
   however
I
 am aware of Lucene payload.
 Thank you very much.

 On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky 
jack.krupan...@gmail.com
 wrote:

  You would do that with a custom similarity (scoring) class.
 That's an
  expert feature. In fact a SUPER-expert feature.
 
  Start by completely familiarizing yourself with how TF*IDF
   similarity
  already works:
 
 

   
  
 http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
 
  And to use your custom similarity class in Solr:
 
 

   
  
 https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity
 
 
  -- Jack Krupansky
 
  On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian 
 alinazem...@gmail.com
   
  wrote:
 
   Hi everybody,
  
   I am going to add some analysis to Solr at the index time.
 Here is
 what I
   am considering in my mind:
   Suppose I have two different fields for Solr schema, field a
 and
 field
   b. I am going to use the created reverse index in a way that
 some
 terms
   are considered as important ones and tell lucene to calculate a
   value
  based
   on these terms frequency per each document. For example let the
   word
   hello considered as important word with the weight of 2.0.
Suppose
  the
   term frequency for this word at field a is 3 and at field
 b is
   6
 for
   document 1. Therefor the score value would be 2*3+(2*6)^2. I
 want
   to
  

Re: Tokenizer or Filter ?

2015-01-13 Thread tomas.kalas
Thanks Jack for your advice. Can you please explain me little more, how it
works? From Apache Wiki it's not to clear for me. I can write some
javaScript code when i want filtering some data ? In this case i have
d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i want
filtering d2 bla bla bla /d2, But in other case i want filtering all
d1  /d1 then i suppose i used it at indexed data and filtering from
them? Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting error while indexing XML files on Hadoop

2015-01-13 Thread celebis


Hi to all from Istanbul, Turkey,

I can say that I'm a newbie in Solr  Hadoop,

I’m trying to index XML files (ipod_other.xml from lucidworks’ example
files, converted into sequence file format), using SolrXMLIngestMapper jars.
I’ve modified the schema.xml file by making the necesssary addions of the
fields stated in the ipod_other.xml file.

*Here’s my command:*
hadoop jar jobjar com.lucidworks.hadoop.ingest.IngestJob
-Dlww.commit.on.close=true -cls
com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c hdp1  -i
/user/hadoop/output/1420812982906sfu/part-r-0 -of
com.lucidworks.hadoop.io.LWMapRedOutputFormat -s
http://dc2vmhadappt01:8983/solr


In the end I constatly get Didn’t ingest any documents, failing error.

Anybody out there to help me out with this problem, any help is
appreciated..

Thanks

*Here are the addions to the schema.xml:*

field name=id type=string indexed=true stored=true required=true
multiValued=false / 
field name=name multiValued=true stored=true  type=text_en
indexed=true/
field name=sku type=text_en_splitting_tight indexed=true
stored=true omitNorms=true/
field name=manu type=text_general indexed=true stored=true
omitNorms=true/
field name=cat type=string indexed=true stored=true
multiValued=true/
field name=features type=text_general indexed=true stored=true
multiValued=true/
field name=includes type=text_general indexed=true stored=true
termVectors=true termPositions=true termOffsets=true /

field name=weight type=float indexed=true stored=true/
field name=price  type=float indexed=true stored=true/
field name=popularity type=int indexed=true stored=true /
field name=inStock type=boolean indexed=true stored=true /

field name=store type=location indexed=true stored=true/

dynamicField name=*_dt  type=dateindexed=true  stored=true/

field name=data_source stored=false type=text_en indexed=true/ 


*And here is the ipod_other.xml file;*

add

doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
  
  field name=store45.17614,-93.87341/field
  field name=manufacturedate_dt2005-08-01T16:30:25Z/field
/doc

doc
  field name=idIW-02/field
  field name=nameiPod amp; iPod Mini USB 2.0 Cable/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter for iPod, white/field
  field name=weight2/field
  field name=price11.50/field
  field name=popularity1/field
  field name=inStockfalse/field
  
  field name=store37.7752,-122.4232/field
  field name=manufacturedate_dt2006-02-14T23:55:59Z/field
/doc


/add






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-error-while-indexing-XML-files-on-Hadoop-tp4179168.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

2015-01-13 Thread Daniel Collins
Is it important where your leader is?  If you just want to minimize
leadership changes during rolling re-start, then you could restart in the
opposite order (S3, S2, S1).  That would give only 1 transition, but the
end result would be a leader on S2 instead of S1 (not sure if that
important to you or not).  I know its not a fix, but it might be a
workaround until the whole leadership moving is done?

On 12 January 2015 at 18:17, Erick Erickson erickerick...@gmail.com wrote:

 Just skimming, but the problem here that I ran into was with the
 listeners. Each _Solr_ instance out there is listening to one of the
 ephemeral nodes (the one in front). So deleting a node does _not_
 change which ephemeral node the associated Solr instance is listening
 to.

 So, for instance, when you delete S2..n-01 and re-add it, S2 is
 still looking at S1n-00 and will continue looking at
 S1...n-00 until S1n-00 is deleted.

 Deleting S2..n-01 will wake up S3 though, which should now be
 looking at S1n-000. Now you have two Solr listeners looking at
 the same ephemeral node. The key is that deleting S2...n-01 does
 _not_ wake up S2, just any solr instance that has a watch on the
 associated ephemeral node.

 The code you want is in LeaderElector.checkIfIamLeader to understand
 how it all works. Be aware that the sortSeqs call sorts the nodes by
 1 sequence number
 2 string comparison.

 Which has the unfortunate characteristic of a secondary sort by
 session ID. So two nodes with the same sequence number can sort before
 or after each other depending on which one gets a session higher/lower
 than the other.

 This is quite tricky to get right, I once created a patch for 4.10.3
 by applying things in this order (some minor tweaks required). All
 SOLR-
 6115
 6512
 6577
 6513
 6517
 6670
 6691

 Good luck!
 Erick




 On Mon, Jan 12, 2015 at 8:54 AM, Zisis Tachtsidis zist...@runbox.com
 wrote:
  SolrCloud uses ZooKeeper sequence flags to keep track of the order in
 which
  nodes register themselves as leader candidates. The node with the lowest
  sequence number wins as leader of the shard.
 
  What I'm trying to do is to keep the leader re-assignments to the minimum
  during a rolling restart. In this direction I change the zk sequence
 numbers
  on the SolrCloud nodes when all nodes of the cluster are up and active.
 I'm
  using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose
 but
  I'm trying to do it from outside, using the existing APIs without
 editing
  Solr source code.
 
  == TYPICAL SCENARIO ==
  Suppose we have 3 Solr instances S1,S2,S3. They are started in the same
  order and the zk sequences assigned have as follows
  S1:-n_00 (LEADER)
  S2:-n_01
  S3:-n_02
 
  In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3
  (after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in
 total.
 
  == MY ATTEMPT ==
  By using SolrZkClient and the Zookeeper multi API  I found a way to get
 rid
  of the old zknodes that participate in a shard's leader election and
 write
  new ones where we can assign the sequence number of our liking.
 
  S1:-n_00 (no code running here)
  S2:-n_04 (code deleting zknode -n_01 and creating
  -n_04)
  S3:-n_03 (code deleting zknode -n_02 and creating
  -n_03)
 
  In a rolling restart I'd expect to have S3 as leader (after S1
 shutdown), no
  change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2
  changes. This will be constant no matter how many servers are added in
  SolrCloud while in the first scenarion the # of re-assignments equals
 the #
  of Solr servers.
 
  The problem occurs when S1 (LEADER) is shut down. The elections that take
  place still set S2 as leader, It's like ignoring the new sequence
 numbers.
  When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed
  under /collections based on which S3 should have become the leader.
  Do you have any idea why the new state is not acknowledged during the
  elections? Is something cached? Or to put it bluntly do I have any chance
  down this path? If not what are my options? Is it possible to apply all
  patches under SOLR-6491 in isolation and continue from there?
 
  Thank you.
 
  Extra info which might help follows
  1. Some logging related to leader elections after S1 has been shut down
  S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with
  shard failed, moving to the next candidate
  S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync,
  but we have no versions - we can't sync in that
 case - we were active before, so become leader anyway
 
  S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in
 line
  to be leader
 
  2. And some sample code on how I perform the ZK re-sequencing
 // Read current zk nodes for a specific collection
 
 
 

Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

2015-01-13 Thread Zisis Tachtsidis
Daniel Collins wrote
 Is it important where your leader is?  If you just want to minimize
 leadership changes during rolling re-start, then you could restart in the
 opposite order (S3, S2, S1).  That would give only 1 transition, but the
 end result would be a leader on S2 instead of S1 (not sure if that
 important to you or not).  I know its not a fix, but it might be a
 workaround until the whole leadership moving is done?

I think that rolling restarting the machines in the opposite order
(S3,S2,S1) will result in S3 being the leader. It's a valid approach but
shouldn't I have to revert to the original order (S1,S2,S3) to achieve the
same result in the following rolling restart? This includes operational
costs and complexity that I want to avoid.


Erick Erickson wrote
 Just skimming, but the problem here that I ran into was with the
 listeners. Each _Solr_ instance out there is listening to one of the
 ephemeral nodes (the one in front). So deleting a node does _not_
 change which ephemeral node the associated Solr instance is listening
 to.

 So, for instance, when you delete S2..n-01 and re-add it, S2 is
 still looking at S1n-00 and will continue looking at
 S1...n-00 until S1n-00 is deleted.

 Deleting S2..n-01 will wake up S3 though, which should now be
 looking at S1n-000. Now you have two Solr listeners looking at
 the same ephemeral node. The key is that deleting S2...n-01 does
 _not_ wake up S2, just any solr instance that has a watch on the
 associated ephemeral node.

Thanks for the info Erick. I wasn't aware of this linked-list listeners
structure between the zk nodes. Based on what you've said though I've
changed my implementation a bit and it seems to be working at first glance.
Of course it's not reliable yet but it looks promising.

My original attempt
 S1:-n_00 (no code running here)
 S2:-n_04 (code deleting zknode -n_01 and creating
 -n_04)
 S3:-n_03 (code deleting zknode -n_02 and creating
 -n_03) 

has been changed to 
S1:-n_00 (no code running here)
S2:-n_03 (code deleting zknode -n_01 and creating
-n_03 using EPHEMERAL_SEQUENTIAL)
S3:-n_02 (no code running here) 

Once S1 is shutdown S3 becomes leader since it listens to S1 now according
to what you've said

The original reason I pursued this minimize leadership changes quest was
that it _could_ lead to data loss in some scenarios. I'm not entirely sure
though and you could correct me on this and but I'm explaining myself.

If you have incoming indexing requests during a rolling restart, could there
be a case during the current leader shutdown where the leader-to-be-node
could not have the time to sync with the
current-leader-that-shut-downs-node in which case everyone will now sync
to the new leader thus missing some updates. I've seen an installation
having different index sizes in each replica that deteriorated over time.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr grouping problem - need help

2015-01-13 Thread Naresh Yadav
*Schema :*
field name=tenant_pool type=text stored=true/

*Code :*
SolrQuery q = new SolrQuery().setQuery(*:*);
q.set(GroupParams.GROUP, true);
q.set(GroupParams.GROUP_FIELD, tenant_pool);

*Data :*
tenant_pool : Baroda Farms
tenant_pool : Ketty Farms

*Output coming :*
groupValue=Farms, docs=2

*Expected Output :*
groupValue=Baroda Farms, docs=1
groupValue=Ketty Farms, docs=1

Please guide me how i can tell solr not to tokenize stored field to decide
unique groups..

I want unique groups as exact value of field not the tokens which solr is
doing
currently.

Thanks
Naresh


Re: Solr startup script in version 4.10.3

2015-01-13 Thread Dominique Bejean
Thank you for your responses.

However, according to my tests, solr 4.10.3 doesn’t use server by default
anymore due to the removal of these lines in the bin/solr script.

# TODO: see SOLR-3619, need to support server or example
# depending on the version of Solr
if [ -e $SOLR_TIP/server/start.jar ]; then
   DEFAULT_SERVER_DIR=$SOLR_TIP/server
else
   DEFAULT_SERVER_DIR=$SOLR_TIP/example
fi


Solr 5.0.0 does in both standalone and solrcloud modes ! This is great !

Dominique
http://www.eolya.fr/


Le jeudi 8 janvier 2015, Anshum Gupta ans...@anshumgupta.net a écrit :

 Things have changed reasonably for the 5.0 release.
 In case of a standalone mode, it still defaults to the server directory. So
 you'd find your logs in server/logs.
 In case of solrcloud mode e.g. if you ran

 bin/solr -e cloud -noprompt

 this would default to stuff being copied into example directory (leaving
 server directory untouched) and everything would run from there.

 You will also have the option of just creating a new SOLR home and using
 that instead. See the following:


 https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

 The link above is for the upcoming Solr 5.0 and is still work in progress
 but should give you more information.
 Hope that helps.


 On Tue, Jan 6, 2015 at 1:29 AM, Dominique Bejean 
 dominique.bej...@eolya.fr javascript:;
 wrote:

  Hi,
 
  In release 4.10.3, the following lines were removed from solr starting
  script (bin/solr)
 
  # TODO: see SOLR-3619, need to support server or example
  # depending on the version of Solr
  if [ -e $SOLR_TIP/server/start.jar ]; then
DEFAULT_SERVER_DIR=$SOLR_TIP/server
  else
DEFAULT_SERVER_DIR=$SOLR_TIP/example
  fi
 
  However, the usage message always say
 
-d dir  Specify the Solr server directory; defaults to server
 
 
  Either the usage have to be fixed or the removed lines put back to the
  script.
 
  Personally, I like the default to server directory.
 
  My installation process in order to have a clean empty solr instance is
 to
  copy examples into server and remove directories like example-DIH,
  example-shemaless, multicore and solr/collection1
 
  Solr server (or node) can be started without the -d parameter.
 
  If this makes sense, a Jira issue could be open.
 
  Dominique
  http://www.eolya.fr/
 



 --
 Anshum Gupta
 http://about.me/anshumgupta



Re: Solr grouping problem - need help

2015-01-13 Thread Jack Krupansky
That's your job. The easiest way is to do a copyField to a string field.

-- Jack Krupansky

On Tue, Jan 13, 2015 at 7:33 AM, Naresh Yadav nyadav@gmail.com wrote:

 *Schema :*
 field name=tenant_pool type=text stored=true/

 *Code :*
 SolrQuery q = new SolrQuery().setQuery(*:*);
 q.set(GroupParams.GROUP, true);
 q.set(GroupParams.GROUP_FIELD, tenant_pool);

 *Data :*
 tenant_pool : Baroda Farms
 tenant_pool : Ketty Farms

 *Output coming :*
 groupValue=Farms, docs=2

 *Expected Output :*
 groupValue=Baroda Farms, docs=1
 groupValue=Ketty Farms, docs=1

 Please guide me how i can tell solr not to tokenize stored field to decide
 unique groups..

 I want unique groups as exact value of field not the tokens which solr is
 doing
 currently.

 Thanks
 Naresh



Re: Extending solr analysis in index time

2015-01-13 Thread Jack Krupansky
A function query or an update processor to create a separate field are
still your best options.

-- Jack Krupansky

On Tue, Jan 13, 2015 at 4:18 AM, Ali Nazemian alinazem...@gmail.com wrote:

 Dear Markus,

 Unfortunately I can not use payload since I want to retrieve this score to
 each user as a simple field alongside other fields. Unfortunately payload
 does not provide that. Also I dont want to change the default similarity
 method of Lucene, I just want to have this filed to do the sorting in some
 cases.
 Best regards.

 On Mon, Jan 12, 2015 at 10:26 PM, Markus Jelsma 
 markus.jel...@openindex.io
 wrote:

  Hi - You mention having a list with important terms, then using payloads
  would be the most straightforward i suppose. You still need a custom
  similarity and custom query parser. Payloads work for us very well.
 
  M
 
 
 
  -Original message-
   From:Ahmet Arslan iori...@yahoo.com.INVALID
   Sent: Monday 12th January 2015 19:50
   To: solr-user@lucene.apache.org
   Subject: Re: Extending solr analysis in index time
  
   Hi Ali,
  
   Reading your example, if you could somehow replace idf component with
  your importance weight,
   I think your use case looks like TFIDFSimilarity. Tf component remains
  same.
  
  
 
 https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
  
   I also suggest you ask this in lucene mailing list. Someone familiar
  with similarity package can give insight on this.
  
   Ahmet
  
  
  
   On Monday, January 12, 2015 6:54 PM, Jack Krupansky 
  jack.krupan...@gmail.com wrote:
   Could you clarify what you mean by Lucene reverse index? That's not a
   term I am familiar with.
  
   -- Jack Krupansky
  
  
   On Mon, Jan 12, 2015 at 1:01 AM, Ali Nazemian alinazem...@gmail.com
  wrote:
  
Dear Jack,
Thank you very much.
Yeah I was thinking of function query for sorting, but I have to
  problems
in this case, 1) function query do the process at query time which I
  dont
want to. 2) I also want to have the score field for retrieving and
  showing
to users.
   
Dear Alexandre,
Here is some more explanation about the business behind the question:
I am going to provide a field for each document, lets refer it as
document_score. I am going to fill this field based on the
  information
that could be extracted from Lucene reverse index. Assume I have a
  list of
terms, called important terms and I am going to extract the term
  frequency
for each of the terms inside this list per each document. To be
 honest
  I
want to use the term frequency for calculating document_score.
document_score should be storable since I am going to retrieve this
  field
for each document. I also want to do sorting on document_store in
  case of
preferred by user.
I hope I did convey my point.
Best regards.
   
   
On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky 
  jack.krupan...@gmail.com

wrote:
   
 Won't function queries do the job at query time? You can add or
  multiply
 the tf*idf score by a function of the term frequency of arbitrary
  terms,
 using the tf, mul, and add functions.

 See:
 https://cwiki.apache.org/confluence/display/solr/Function+Queries

 -- Jack Krupansky

 On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian 
  alinazem...@gmail.com
 wrote:

  Dear Jack,
  Hi,
  I think you misunderstood my need. I dont want to change the
  default
  scoring behavior of Lucene (tf-idf) I just want to have another
  field
to
 do
  sorting for some specific queries (not all the search business),
however
 I
  am aware of Lucene payload.
  Thank you very much.
 
  On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
   You would do that with a custom similarity (scoring) class.
  That's an
   expert feature. In fact a SUPER-expert feature.
  
   Start by completely familiarizing yourself with how TF*IDF
similarity
   already works:
  
  
 

   
 
 http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
  
   And to use your custom similarity class in Solr:
  
  
 

   
 
 https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity
  
  
   -- Jack Krupansky
  
   On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian 
  alinazem...@gmail.com

   wrote:
  
Hi everybody,
   
I am going to add some analysis to Solr at the index time.
  Here is
  what I
am considering in my mind:
Suppose I have two different fields for Solr schema, field
 a
  and
  field
b. I am going to use the created reverse index in a way
 that
  some
  terms
are considered as important ones and tell lucene to
 

Re: Solr grouping problem - need help

2015-01-13 Thread Naresh Yadav
Hi jack,

Thanks for replying, i am new to solr please guide me on this. I have many
such columns in my schema
so copy field will create lot of duplicate fields beside i do not need any
search on original field.

My usecase is i do not want any search on tenant_pool field thats why i
declared it as stored field not indexed.
I just need to get unique values in this field. Please show some direction.


On Tue, Jan 13, 2015 at 6:16 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 That's your job. The easiest way is to do a copyField to a string field.

 -- Jack Krupansky

 On Tue, Jan 13, 2015 at 7:33 AM, Naresh Yadav nyadav@gmail.com
 wrote:

  *Schema :*
  field name=tenant_pool type=text stored=true/
 
  *Code :*
  SolrQuery q = new SolrQuery().setQuery(*:*);
  q.set(GroupParams.GROUP, true);
  q.set(GroupParams.GROUP_FIELD, tenant_pool);
 
  *Data :*
  tenant_pool : Baroda Farms
  tenant_pool : Ketty Farms
 
  *Output coming :*
  groupValue=Farms, docs=2
 
  *Expected Output :*
  groupValue=Baroda Farms, docs=1
  groupValue=Ketty Farms, docs=1
 
  Please guide me how i can tell solr not to tokenize stored field to
 decide
  unique groups..
 
  I want unique groups as exact value of field not the tokens which solr is
  doing
  currently.
 
  Thanks
  Naresh
 




-- 
Cheers,

Naresh Yadav
+919960523401
http://nareshyadav.blogspot.com/
SSE, MetrixLine Inc.


Occasionally getting error in solr suggester component.

2015-01-13 Thread Dhanesh Radhakrishnan
Hi all,

I am experiencing a problem in Solr SuggestComponent
Occasionally solr suggester component throws an  error like

Solr failed:
{responseHeader:{status:500,QTime:1},error:{msg:suggester was
not built,trace:java.lang.IllegalStateException: suggester was not
built\n\tat
org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.lookup(AnalyzingInfixSuggester.java:368)\n\tat
org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.lookup(AnalyzingInfixSuggester.java:342)\n\tat
org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat
org.apache.solr.spelling.suggest.SolrSuggester.getSuggestions(SolrSuggester.java:199)\n\tat
org.apache.solr.handler.component.SuggestComponent.process(SuggestComponent.java:234)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)\n\tat
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)\n\tat
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)\n\tat
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)\n\tat
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)\n\tat
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)\n\tat
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)\n\tat
org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680)\n\tat
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)\n\tat
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)\n\tat
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1002)\n\tat
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)\n\tat
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:745)\n,code:500}}

This is not freequently happening, but idexing and suggestor component
working togethere  this error will occur.




In solr config

searchComponent name=suggest class=solr.SuggestComponent
lst name=suggester
  str name=namehaSuggester/str
  str name=lookupImplAnalyzingInfixLookupFactory/str  !--
org.apache.solr.spelling.suggest.fst --
  str name=suggestAnalyzerFieldTypetextSpell/str
  str name=dictionaryImplDocumentDictionaryFactory/str !--
org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --
  str name=fieldname/str
  str name=weightFieldpackageWeight/str
  str name=buildOnCommittrue/str
/lst
  /searchComponent

  requestHandler name=/suggest class=solr.SearchHandler startup=lazy
lst name=defaults
  str name=suggesttrue/str
  str name=suggest.count10/str
/lst
arr name=components
  strsuggest/str
/arr
  /requestHandler

Can any one suggest where to look to figure out this error and why these
errors are occurring?



Thanks,
dhanesh s.r




--

-- 

--
IMPORTANT: This is an e-mail from HiFX IT Media Services Pvt. Ltd. Its 
content are confidential to the intended recipient. If you are not the 
intended recipient, be advised that you have received this e-mail in error 
and that any use, dissemination, forwarding, printing or copying of this 
e-mail is strictly prohibited. It may not be disclosed to or used by anyone 
other than its intended recipient, nor may it be copied in any way. If 
received in error, please email a reply to the sender, then delete it from 
your system. 

Although this e-mail has been scanned for viruses, HiFX cannot ultimately 
accept any responsibility for viruses and it is your responsibility to scan 
attachments (if any).

​
Before you print this email or attachments, please consider the negative 
environmental impacts associated with printing.


Re: Solr fails to start with log file not found error

2015-01-13 Thread Erick Erickson
By any chance are you trying to start Solr as a different user when
this happens? I'm
wondering if there's a permissions issue here

Wild guess.

On Tue, Jan 13, 2015 at 12:37 AM, Graeme Pietersz gra...@pietersz.net wrote:
 I get this error when starting Solr using the script in bin/solr

 tail cannot open `[path]/logs/solr.log’ for reading: No such file or directory

 It does not happen every time, but it does happen a lot. It sometimes clears 
 up after a while.

 I have tried creating an empty file, but solr then just says:

 Backing up [path]/logs/solr.log

 And repeats the same error.

 I am guessing the problem is that it cannot get the error from the log file 
 because the log file has not been created yet, but then how do I debug this?

 Running  Solr 4.10.2 on Debian 7 using Jetty with the default IcedTea 2.5.3 
 java version 1.7.0_65

 Thanks for any help or pointers.


Re: Extending solr analysis in index time

2015-01-13 Thread Ali Nazemian
I decided to go for function query and implementing function query to read
term frequency for each document from index. Anyway I did not find any
tutorial which is matched my problem well. I really appreciate if somebody
could provide me some useful tutorial or example for this case.
Thank you very much.

On Tue, Jan 13, 2015 at 4:21 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 A function query or an update processor to create a separate field are
 still your best options.

 -- Jack Krupansky

 On Tue, Jan 13, 2015 at 4:18 AM, Ali Nazemian alinazem...@gmail.com
 wrote:

  Dear Markus,
 
  Unfortunately I can not use payload since I want to retrieve this score
 to
  each user as a simple field alongside other fields. Unfortunately payload
  does not provide that. Also I dont want to change the default similarity
  method of Lucene, I just want to have this filed to do the sorting in
 some
  cases.
  Best regards.
 
  On Mon, Jan 12, 2015 at 10:26 PM, Markus Jelsma 
  markus.jel...@openindex.io
  wrote:
 
   Hi - You mention having a list with important terms, then using
 payloads
   would be the most straightforward i suppose. You still need a custom
   similarity and custom query parser. Payloads work for us very well.
  
   M
  
  
  
   -Original message-
From:Ahmet Arslan iori...@yahoo.com.INVALID
Sent: Monday 12th January 2015 19:50
To: solr-user@lucene.apache.org
Subject: Re: Extending solr analysis in index time
   
Hi Ali,
   
Reading your example, if you could somehow replace idf component with
   your importance weight,
I think your use case looks like TFIDFSimilarity. Tf component
 remains
   same.
   
   
  
 
 https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
   
I also suggest you ask this in lucene mailing list. Someone familiar
   with similarity package can give insight on this.
   
Ahmet
   
   
   
On Monday, January 12, 2015 6:54 PM, Jack Krupansky 
   jack.krupan...@gmail.com wrote:
Could you clarify what you mean by Lucene reverse index? That's
 not a
term I am familiar with.
   
-- Jack Krupansky
   
   
On Mon, Jan 12, 2015 at 1:01 AM, Ali Nazemian alinazem...@gmail.com
 
   wrote:
   
 Dear Jack,
 Thank you very much.
 Yeah I was thinking of function query for sorting, but I have to
   problems
 in this case, 1) function query do the process at query time which
 I
   dont
 want to. 2) I also want to have the score field for retrieving and
   showing
 to users.

 Dear Alexandre,
 Here is some more explanation about the business behind the
 question:
 I am going to provide a field for each document, lets refer it as
 document_score. I am going to fill this field based on the
   information
 that could be extracted from Lucene reverse index. Assume I have a
   list of
 terms, called important terms and I am going to extract the term
   frequency
 for each of the terms inside this list per each document. To be
  honest
   I
 want to use the term frequency for calculating document_score.
 document_score should be storable since I am going to retrieve
 this
   field
 for each document. I also want to do sorting on document_store in
   case of
 preferred by user.
 I hope I did convey my point.
 Best regards.


 On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky 
   jack.krupan...@gmail.com
 
 wrote:

  Won't function queries do the job at query time? You can add or
   multiply
  the tf*idf score by a function of the term frequency of arbitrary
   terms,
  using the tf, mul, and add functions.
 
  See:
 
 https://cwiki.apache.org/confluence/display/solr/Function+Queries
 
  -- Jack Krupansky
 
  On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian 
   alinazem...@gmail.com
  wrote:
 
   Dear Jack,
   Hi,
   I think you misunderstood my need. I dont want to change the
   default
   scoring behavior of Lucene (tf-idf) I just want to have another
   field
 to
  do
   sorting for some specific queries (not all the search
 business),
 however
  I
   am aware of Lucene payload.
   Thank you very much.
  
   On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky 
  jack.krupan...@gmail.com
   wrote:
  
You would do that with a custom similarity (scoring) class.
   That's an
expert feature. In fact a SUPER-expert feature.
   
Start by completely familiarizing yourself with how TF*IDF
 similarity
already works:
   
   
  
 

  
 
 http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
   
And to use your custom similarity class in Solr:
   
   
  
 

  
 
 

Highting whole pharse

2015-01-13 Thread meena.sri...@mathworks.com
Highlighting does not highlight the whole Phrase, instead each word gets
highlighted.
I tried all the suggestions that was given, with no luck
These are my special setting I tried for phrase highlighting
hl.usePhraseHighlighter=true
hl.q=query


http://localhost.mathworks.com:8983/solr/db/select?q=syndrome%3A%22Override+ignored+for+property%22rows=1fl=syndrome_idwt=jsonindent=truehl=truehl.simple.pre=%3Cem%3Ehl.simple.post=%3C%2Fem%3Ehl.usePhraseHighlighter=truehl.q=%22Override+ignored+for+property%22hl.fragsize=1000



This is from my schema.xml
field name=syndrome type=text_general indexed=true stored=true/


Should I add parameters in the indexing stage itself to make this work?

Thanks for your time.

Meena




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highting-whole-pharse-tp4179219.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers

2015-01-13 Thread Erick Erickson
SolrCloud is intended to work in the rolling restart case...

Index size, segment counts, segment names can (and will)
be different on different replicas of the same shard without
anything being amiss. Commits (hard) happen at different
times across the replicas in a shard. Merging logic kicks in
and may (will eventually in all probability) pick different
segments to merge, with varying numbers of deleted docs
that get purged etc.

The numFound reported on a q=*:*distrib=false, or looking at the
core in the admin screen for the replicas in question and noting
numDocs should be identical though if
1 you've issued a hard commit with openSearcher=true _or_
 a soft commit.
2 you haven't been indexing or haven't issued a commit
 as in 1 since you started looking.

Best,
Erick

On Tue, Jan 13, 2015 at 4:20 AM, Zisis Tachtsidis zist...@runbox.com wrote:
 Daniel Collins wrote
 Is it important where your leader is?  If you just want to minimize
 leadership changes during rolling re-start, then you could restart in the
 opposite order (S3, S2, S1).  That would give only 1 transition, but the
 end result would be a leader on S2 instead of S1 (not sure if that
 important to you or not).  I know its not a fix, but it might be a
 workaround until the whole leadership moving is done?

 I think that rolling restarting the machines in the opposite order
 (S3,S2,S1) will result in S3 being the leader. It's a valid approach but
 shouldn't I have to revert to the original order (S1,S2,S3) to achieve the
 same result in the following rolling restart? This includes operational
 costs and complexity that I want to avoid.


 Erick Erickson wrote
 Just skimming, but the problem here that I ran into was with the
 listeners. Each _Solr_ instance out there is listening to one of the
 ephemeral nodes (the one in front). So deleting a node does _not_
 change which ephemeral node the associated Solr instance is listening
 to.

 So, for instance, when you delete S2..n-01 and re-add it, S2 is
 still looking at S1n-00 and will continue looking at
 S1...n-00 until S1n-00 is deleted.

 Deleting S2..n-01 will wake up S3 though, which should now be
 looking at S1n-000. Now you have two Solr listeners looking at
 the same ephemeral node. The key is that deleting S2...n-01 does
 _not_ wake up S2, just any solr instance that has a watch on the
 associated ephemeral node.

 Thanks for the info Erick. I wasn't aware of this linked-list listeners
 structure between the zk nodes. Based on what you've said though I've
 changed my implementation a bit and it seems to be working at first glance.
 Of course it's not reliable yet but it looks promising.

 My original attempt
 S1:-n_00 (no code running here)
 S2:-n_04 (code deleting zknode -n_01 and creating
 -n_04)
 S3:-n_03 (code deleting zknode -n_02 and creating
 -n_03)

 has been changed to
 S1:-n_00 (no code running here)
 S2:-n_03 (code deleting zknode -n_01 and creating
 -n_03 using EPHEMERAL_SEQUENTIAL)
 S3:-n_02 (no code running here)

 Once S1 is shutdown S3 becomes leader since it listens to S1 now according
 to what you've said

 The original reason I pursued this minimize leadership changes quest was
 that it _could_ lead to data loss in some scenarios. I'm not entirely sure
 though and you could correct me on this and but I'm explaining myself.

 If you have incoming indexing requests during a rolling restart, could there
 be a case during the current leader shutdown where the leader-to-be-node
 could not have the time to sync with the
 current-leader-that-shut-downs-node in which case everyone will now sync
 to the new leader thus missing some updates. I've seen an installation
 having different index sizes in each replica that deteriorated over time.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: SpellCheck (AutoComplete) Not Working In Distributed Environment

2015-01-13 Thread Charles Sanders
Still not able to get my autoComplete component to work in a distributed 
environment. Works fine on a non-distributed system. Also, on the distributed 
system, if I include distrib=false, it works. 

I have tried shards.qt and shards parameters, but they make no difference. I 
should add, I am running SolrCloud and ZooKeeper, if that makes any difference. 
I have played around with this quite a bit, but nothing seems to work. 

When I add shards.qt=/ac {the name of the request handler}, I get an error in 
the solr logs. It simply states: java.lang.NullPointerException. That's it 
nothing more. This is listed as logger SolrCore and SolrDispatchFilter. 

Any ideas, suggestions on how I can troubleshoot and find the problem? Is there 
something specific I should look for? 

Please find attached text file with relevant information from schema.xml and 
sorlconfig.xml. 

Any help greatly appreciated! Thanks, 
-Charles 



- Original Message -

From: Erick Erickson erickerick...@gmail.com 
To: solr-user@lucene.apache.org 
Sent: Tuesday, December 30, 2014 6:07:13 PM 
Subject: Re: SpellCheck (AutoComplete) Not Working In Distributed Environment 

Did you try the shards parameter? See: 
https://cwiki.apache.org/confluence/display/solr/Spell+Checking#SpellChecking-DistributedSpellCheck
 

On Tue, Dec 30, 2014 at 2:20 PM, Charles Sanders csand...@redhat.com wrote: 
 I'm running Solr 4.8 in a distributed environment (2 shards). I have added 
 the spellcheck component to my request handler. In my test system, which is 
 not distributed, it works. But when I move it to the Dev box, which is 
 distributed, 2 shards, it is not working. Is there something additional I 
 must do to get this to work in a distributed environment? 
 
 requestHandler default=true name=standard class=solr.SearchHandler 
 !-- default values for query parameters can be specified, these 
 will be overridden by parameters in the request 
 -- 
 lst name=defaults 
 str name=echoParamsexplicit/str 
 int name=rows10/int 
 str name=dfallText/str 
 !-- default autocomplete settings for this search request handler -- 
 str name=spellchecktrue/str 
 str name=spellcheck.dictionaryandreasAutoComplete/str 
 str name=spellcheck.onlyMorePopulartrue/str 
 str name=spellcheck.count5/str 
 str name=spellcheck.collatetrue/str 
 str name=spellcheck.maxCollations5/str 
 /lst 
 arr name=last-components 
 strautoComplete/str 
 /arr 
 /requestHandler 
 
 searchComponent name=autoComplete class=solr.SpellCheckComponent 
 lst name=spellchecker 
 str name=nameandreasAutoComplete/str 
 str name=classnameorg.apache.solr.spelling.suggest.Suggester/str 
 str 
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookupFactory/str 
 str name=fieldsugg_allText/str 
 str name=buildOnCommittrue/str 
 float name=threshold.005/float 
 str name=queryAnalyzerFieldTypetext_suggest/str 
 /lst 
 /searchComponent 
 
 
 Any help greatly appreciated! Thanks, 
 -Charles 
 
 
 

* Schema.xml ***
field name=issue_suggest type=text_suggest indexed=true stored=false/
field name=sugg_allText type=text_suggest indexed=true 
multiValued=true stored=false/

fieldType name=text_suggest class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


 Solrconfig.xml ***

!-- Auto-Complete component --
searchComponent name=autoComplete class=solr.SpellCheckComponent
lst name=spellchecker
str name=nameandreasAutoComplete/str
str 
name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str 
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookupFactory/str   
   
str name=fieldsugg_allText/str
str name=buildOnCommittrue/str
float name=threshold.005/float
str name=queryAnalyzerFieldTypetext_suggest/str
/lst
lst name=spellchecker
str name=namerecommendationsAutoComplete/str
str 
name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str 
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookupFactory/str   
   
str name=fieldissue_suggest/str
str name=buildOnCommittrue/str
float name=threshold.005/float
str name=queryAnalyzerFieldTypetext_suggest/str
/lst
/searchComponent

requestHandler name=/ac class=solr.SearchHandler
lst name=defaults
str name=spellchecktrue/str

Re: Solr limiting number of rows to indexed to 21500 every time.

2015-01-13 Thread Michael Della Bitta
Looks like you have an underlying JDBC problem. The socket representing
your database connection seems to be going away. Have you tried running
this query outside of Solr and iterating through all the results? How about
in a standalone Java program? Do you have a DBA you can consult to see if
there are any errors on the Oracle side?

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Tue, Jan 13, 2015 at 2:31 AM, Pankaj Sonawane pankaj4sonaw...@gmail.com
wrote:

 Hi,

 I am using Solr DataImportHandler to index data from database
 table(Oracle). One of the column contains String representation of XML
 (Sample below).

 *options*
 *option name=A1/option*

 *option name=B2/option*

 *option name=C3/option*
 *.*
 *.*
 *.*

 */options //option can be 100-200*

 I want solr to index each 'name' in 'option' tag against its value

 ex. JSON for 1 row
 docs: [ {
 COL1: F,
 COL2: ASDF, COL3: ATCC, COL4: 29039757, A_s: 1, B_s: 2, 
 C_s: 3,
 .
 .
 .
 *  }*
 // appending '_s' to 'name' attribute for making dynamic fields.


 But while indexing data, *every time only 21500 rows get indexed*. After
 these much records get indexed I got following exception:

 *1320927 [Thread-15] ERROR
 org.apache.solr.handler.dataimport.EntityProcessorBase  û getNext() failed
 for query 'SELECT col1,col2,col3,col4,XMLSERIALIZE(col5 AS  CLOB) AS col5
 FROM
 tableName':org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.sql.SQLRecoverableException: No more data to read from socket*
 *at

 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)*
 *at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:378)*
 *at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:258)*
 *at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:293)*
 *at

 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:116)*
 *at

 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)*
 *at

 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)*
 *at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)*
 *at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)*
 *at

 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)*
 *at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)*
 *at

 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)*
 *at

 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)*
 *at

 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)*
 *Caused by: java.sql.SQLRecoverableException: No more data to read from
 socket*
 *at
 oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1200)*
 *at
 oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1865)*
 *at
 oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1757)*
 *at
 oracle.jdbc.driver.T4CMAREngine.unmarshalCLR(T4CMAREngine.java:1750)*
 *at

 oracle.jdbc.driver.T4CClobAccessor.handlePrefetch(T4CClobAccessor.java:543)*
 *at

 oracle.jdbc.driver.T4CClobAccessor.unmarshalOneRow(T4CClobAccessor.java:197)*
 *at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:916)*
 *at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:835)*
 *at oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:664)*
 *at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:328)*
 *at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:186)*
 *at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:521)*
 *at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:194)*
 *at oracle.jdbc.driver.T4CStatement.fetch(T4CStatement.java:1074)*
 *at

 oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:369)*
 *at
 oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:273)*
 *at

 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:370)*
 *... 12 more*

 *1320928 [Thread-15] ERROR org.apache.solr.handler.dataimport.DocBuilder  û
 Exception while processing: e1 document : SolrInputDocument(fields:
 

Re: Solr grouping problem - need help

2015-01-13 Thread Erick Erickson
Something is very wrong here. Have you perhaps been changing your
schema without re-indexing? And I recommend you completely remove
your data directory (the one with index and tlog subdirectories) after
you change your schema.xml file.

Because you're trying to group on a field that is _not_ indexed, you
should be getting an error returned, something like:
can not use FieldCache on a field which is neither indexed nor has
doc values: 

As far as the tokenization comment is, just start by making the field you want
to group on be
stored=false indexed=true type=string

Best,
Erick

On Tue, Jan 13, 2015 at 5:09 AM, Naresh Yadav nyadav@gmail.com wrote:
 Hi jack,

 Thanks for replying, i am new to solr please guide me on this. I have many
 such columns in my schema
 so copy field will create lot of duplicate fields beside i do not need any
 search on original field.

 My usecase is i do not want any search on tenant_pool field thats why i
 declared it as stored field not indexed.
 I just need to get unique values in this field. Please show some direction.


 On Tue, Jan 13, 2015 at 6:16 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 That's your job. The easiest way is to do a copyField to a string field.

 -- Jack Krupansky

 On Tue, Jan 13, 2015 at 7:33 AM, Naresh Yadav nyadav@gmail.com
 wrote:

  *Schema :*
  field name=tenant_pool type=text stored=true/
 
  *Code :*
  SolrQuery q = new SolrQuery().setQuery(*:*);
  q.set(GroupParams.GROUP, true);
  q.set(GroupParams.GROUP_FIELD, tenant_pool);
 
  *Data :*
  tenant_pool : Baroda Farms
  tenant_pool : Ketty Farms
 
  *Output coming :*
  groupValue=Farms, docs=2
 
  *Expected Output :*
  groupValue=Baroda Farms, docs=1
  groupValue=Ketty Farms, docs=1
 
  Please guide me how i can tell solr not to tokenize stored field to
 decide
  unique groups..
 
  I want unique groups as exact value of field not the tokens which solr is
  doing
  currently.
 
  Thanks
  Naresh
 




 --
 Cheers,

 Naresh Yadav
 +919960523401
 http://nareshyadav.blogspot.com/
 SSE, MetrixLine Inc.


Re: Tokenizer or Filter ?

2015-01-13 Thread Jack Krupansky
Would it be sufficient for your user case to simply extract all the d1
into one field and all the d2 in another field? If so, the update
processor script would be very simple, simply matching all d1.*/d1
and copying them to a separate field value and same for d2.

If you want examples of script update processors, see my Solr e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote:

 Thanks Jack for your advice. Can you please explain me little more, how it
 works? From Apache Wiki it's not to clear for me. I can write some
 javaScript code when i want filtering some data ? In this case i have
 d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i
 want
 filtering d2 bla bla bla /d2, But in other case i want filtering all
 d1  /d1 then i suppose i used it at indexed data and filtering from
 them? Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: leader split-brain at least once a day - need help

2015-01-13 Thread Shawn Heisey
On 1/12/2015 5:34 AM, Thomas Lamy wrote:
 I found no big/unusual GC pauses in the Log (at least manually; I
 found no free solution to analyze them that worked out of the box on a
 headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
 before) on one of the nodes, after checking allocation after 1 hour
 run time was at about 2-3GB. That didn't move the time frame where a
 restart was needed, so I don't think Solr's JVM GC is the problem.
 We're trying to get all of our node's logs (zookeeper and solr) into
 Splunk now, just to get a better sorted view of what's going on in the
 cloud once a problem occurs. We're also enabling GC logging for
 zookeeper; maybe we were missing problems there while focussing on
 solr logs.

If you make a copy of the gc log, you can put it on another system with
a GUI and graph it with this:

http://sourceforge.net/projects/gcviewer

Just double-click on the jar to run the program.  I find it is useful
for clarity on the graph to go to the View menu and uncheck everything
except the two GC Times options.  You can also change the zoom to a
lower percentage so you can see more of the graph.

That program is how I got the graph you can see on my wiki page about GC
tuning:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Another possible problem is that your install is exhausting the thread
pool.  Tomcat defaults to a maxThreads value of only 200.  There's a
good chance that your setup will need more than 200 threads at least
occasionally.  If you're near the limit, having a thread problem once
per day based on index activity seems like a good possibility.  Try
setting maxThreads to 1 in the Tomcat config.

Thanks,
Shawn



Re: Solr grouping problem - need help

2015-01-13 Thread Naresh Yadav
Erick, my schema is same no change in that..
*Schema :*
field name=tenant_pool type=text stored=true/
my guess is i had not mentioned indexed true or falsemay be default
indexed is true

My question is for indexed=false, stored=true field..what is optimized way
to get unique values in such field..

On Tue, Jan 13, 2015 at 10:07 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Something is very wrong here. Have you perhaps been changing your
 schema without re-indexing? And I recommend you completely remove
 your data directory (the one with index and tlog subdirectories) after
 you change your schema.xml file.

 Because you're trying to group on a field that is _not_ indexed, you
 should be getting an error returned, something like:
 can not use FieldCache on a field which is neither indexed nor has
 doc values: 

 As far as the tokenization comment is, just start by making the field you
 want
 to group on be
 stored=false indexed=true type=string

 Best,
 Erick

 On Tue, Jan 13, 2015 at 5:09 AM, Naresh Yadav nyadav@gmail.com
 wrote:
  Hi jack,
 
  Thanks for replying, i am new to solr please guide me on this. I have
 many
  such columns in my schema
  so copy field will create lot of duplicate fields beside i do not need
 any
  search on original field.
 
  My usecase is i do not want any search on tenant_pool field thats why i
  declared it as stored field not indexed.
  I just need to get unique values in this field. Please show some
 direction.
 
 
  On Tue, Jan 13, 2015 at 6:16 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
  That's your job. The easiest way is to do a copyField to a string
 field.
 
  -- Jack Krupansky
 
  On Tue, Jan 13, 2015 at 7:33 AM, Naresh Yadav nyadav@gmail.com
  wrote:
 
   *Schema :*
   field name=tenant_pool type=text stored=true/
  
   *Code :*
   SolrQuery q = new SolrQuery().setQuery(*:*);
   q.set(GroupParams.GROUP, true);
   q.set(GroupParams.GROUP_FIELD, tenant_pool);
  
   *Data :*
   tenant_pool : Baroda Farms
   tenant_pool : Ketty Farms
  
   *Output coming :*
   groupValue=Farms, docs=2
  
   *Expected Output :*
   groupValue=Baroda Farms, docs=1
   groupValue=Ketty Farms, docs=1
  
   Please guide me how i can tell solr not to tokenize stored field to
  decide
   unique groups..
  
   I want unique groups as exact value of field not the tokens which
 solr is
   doing
   currently.
  
   Thanks
   Naresh
  
 
 
 
 
  --
  Cheers,
 
  Naresh Yadav
  +919960523401
  http://nareshyadav.blogspot.com/
  SSE, MetrixLine Inc.



Re: Solr large boolean filter

2015-01-13 Thread Alexandre Rafalovitch
TermsQueryParser I think is somewhat new. Have you tried that one?
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 13 January 2015 at 12:54, rashmy1 rashm...@gmail.com wrote:
 Hello,
 We have a similar requirement where a large list of IDs needs to be sent to
 SOLR in filter query.
 Could someone please help understand if this feature is now supported in the
 new versions of SOLR?

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747p4179276.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Engage custom hit collector for special search processing

2015-01-13 Thread tedsolr
As insane as it sounds, I need to process all the results. No one document is
more or less important than another. Only a few hundred unique docs will
be sent to the client at any one time, but the users expect to page through
them all.

I don't expect sub-second performance for this task. I'm just hoping for
something reasonable, and I can't define that either.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348p4179366.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr unit tests intermittently fail with error: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/security/CertificateUtils

2015-01-13 Thread Shawn Heisey
On 1/13/2015 2:50 PM, brian4 wrote:
 The problem is the jetty-util version included in the Solr build is 6.1.26,
 but this particular package is from version 7+.  Looks like it is a bug in
 the build files for Solr.

 I fixed it by downloading jetty 7 separately and manually adding
 jetty-util-7.6.16.v20140903.jar to the end of my classpath.

The jetty version included in the Solr build for 4.x is
8.1.10.v20130312.  There is a dependency in *Lucene* for 6.1.26, but
it's a completely optional Lucene add-on that is not used in *Solr*.

Thanks,
Shawn



Engage custom hit collector for special search processing

2015-01-13 Thread tedsolr
I have a complicated problem to solve, and I don't know enough about
lucene/solr to phrase the question properly. This is kind of a shot in the
dark. My requirement is to return search results always in completely
collapsed form, rolling up duplicates with a count. Duplicates are defined
by whatever fields are requested. If the search requests fields A, B, C,
then all matched documents that have identical values for those 3 fields are
dupes. The field list may change with every new search request. What I do
know is the super set of all fields that may be part of the field list at
index time.

I know this can't be done with configuration alone. It doesn't seem
performant to retrieve all 1M+ docs and post process in Java. A very smart
person told me that a custom hit collector should be able to do the
filtering for me. So, maybe I create a custom search handler that somehow
exposes this custom hit collector that can use FieldCache or DocValues to
examine all the matches and filter the results in the way I've described
above.

So assuming this is a viable solution path, can anyone suggest some helpful
posts, code fragments, books for me to review? I admit to being out of my
depth, but this requirement isn't going away. I'm grasping for straws right
now.

thanks
(using Solr 4.9)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow faceting performance on a docValues field

2015-01-13 Thread David Smith
Shawn,
I've been thinking along your lines, and continued to run tests through the 
day.  The results surprised me.

For my index, Solr range faceting time is most closely related to the total 
number of documents in the index for the range specified.  The number of 
buckets in the range is a second factor.   

I found NO correlation whatsoever to the number of hits in the query.  Whether 
I have 3 hits or 1,500,000 hits, it's ~24 seconds to facet the result for that 
same time period.  That is what surprised me.

For example, if my facet range is a 10 year period for which there exists 47M 
docs in the index, the facet time is 24 seconds.  If I switch my facet range to 
a different 10 year period with 1.3M docs, the facet time drops to less than 5 
seconds.  

If I go back to my original 10 year period (with 47M docs in the index), but 
facet by month instead of day, my facet time drops to 2.5 seconds.  Now, I 
can't meet my user needs this way, but it does show the relationship between # 
of buckets and faceting time.

Regards,

David

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Chris Hostetter

: ...the nuts  bolts of it is that the PostingFormat baseclass should take 
: care of all the SPI name registration that you need based on what you 
: pass to the super() construction ... allthough now that i think about it, 
: i'm not sure how you'd go about specifying your own name for the 
: PostingFormat when also doing something like subclassing 
: Lucene41PostingsFormat ... there's no Lucene41PostingsFormat constructor 
: you can call from your subclass to override the name.
: 
: not sure what the expectation is there in the java API.

ok, so i talked this through with mikemccand on IRC...

in 4x, the API is actaully really dangerous - you can subclass things like 
Lucene41PostingsFormat w/o overriding the name used in SPI, and might 
really screw things up as far as what class is used to read back your 
files later.

in the 5.0 APIs, these non-abstract codec related classes are all final to 
prevent exactly this type of behavior - but you can still use the 
constructor args to change behavior related to *writing* the index, and 
the classes all are designed to be smart enough that when they are loaded 
by SPI at search time, they can make sense of what's on disk (regardlessof 
wether non-default constructor args were used at index time)

but the question remains: where does that leave you as a solr user who 
wants to write a plugin, since Solr only allows you to configure the SPI 
name (no constructor args) via 'postingFormat=foo'

the anwser is that instead of writing a subclass, you would have to write 
a small proxy class, something like...


public final class MyPfWrapper extends PostingFormat {
  PostingFormat pf = new Lucene50PostingsFormat(42, 9);
  public MyPfWrapper() {
super(MyPfWrapper);
  }
  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws 
IOException {
return pf.fieldsConsumer(state);
  }
  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws 
IOException {
return pf.fieldsConsumer(state);
  }
  public FieldsProducer fieldsProducer(SegmentReadState state) throws 
IOException {
return pf.fieldsProducer(state);
  }
} 

..and then refer to it with postingFormat=MyPfWrapper

at index time, Solr will use SPI to find your MyPfWrapper class, which 
will delegate to an instance of Lucene50PostingsFormat constructed with 
the overriden constants, and then at query time the SegmentReader code 
paths will use SPI to find MyPfWrapper by name as well, and it will again 
delegate to Lucene50PostingsFormat for reading back the index.


or at least: that's how it *should* work :)




-Hoss
http://www.lucidworks.com/


Re: Slow faceting performance on a docValues field

2015-01-13 Thread Shawn Heisey
On 1/13/2015 11:44 AM, David Smith wrote:
 I looked at Interval faceting.  My required interval is 1 day.  I cannot 
 change that requirement.  Unless I am mis-reading the doc, that means to 
 facet a 10 year range, the query needs to specify over 3,600 intervals ??

I am very ignorant of how the internals work ... but it sounds like the
parameters you have chosen are basically making thousands of separate
facets, almost all of which will ultimately return zero, and therefore
be excluded from the results.

If my naive assessment of the situation is even close to accurate, then
I think the rest of this paragraph would apply:  If we assume that those
individual facets are running consecutively, each one would be
completing in single-digit-millisecond time to add up to about 25
seconds.  If we assume they are running in parallel, that's a LOT of
work to handle all at once, and the actual workload might look more like
it's consecutive because there aren't enough CPU resources to handle
them truly in parallel.  I don't know that thousands of facets can be
sped up very much.

Thanks,
Shawn



Re: Engage custom hit collector for special search processing

2015-01-13 Thread Jack Krupansky
Do you have a sense of what your typical queries would look like? I mean,
maybe you wouldn't actually need to fetch more than a tiny fraction of
those million documents. Do you only need to determine the top 10 or 20 or
50 unique field value row sets, or do you need to determine ALL unique row
sets? The latter would never be very performant even as a custom
handler/collector since it would have to scan all rows.

Try a client-side solution that reads 100 (or 50 or 20 or 200) rows at a
time, storing rows by the unique combination of field values, until you hit
the threshold needed for number of unique row sets.

-- Jack Krupansky

On Tue, Jan 13, 2015 at 4:29 PM, tedsolr tsm...@sciquest.com wrote:

 I have a complicated problem to solve, and I don't know enough about
 lucene/solr to phrase the question properly. This is kind of a shot in the
 dark. My requirement is to return search results always in completely
 collapsed form, rolling up duplicates with a count. Duplicates are
 defined
 by whatever fields are requested. If the search requests fields A, B, C,
 then all matched documents that have identical values for those 3 fields
 are
 dupes. The field list may change with every new search request. What I do
 know is the super set of all fields that may be part of the field list at
 index time.

 I know this can't be done with configuration alone. It doesn't seem
 performant to retrieve all 1M+ docs and post process in Java. A very smart
 person told me that a custom hit collector should be able to do the
 filtering for me. So, maybe I create a custom search handler that somehow
 exposes this custom hit collector that can use FieldCache or DocValues to
 examine all the matches and filter the results in the way I've described
 above.

 So assuming this is a viable solution path, can anyone suggest some helpful
 posts, code fragments, books for me to review? I admit to being out of my
 depth, but this requirement isn't going away. I'm grasping for straws right
 now.

 thanks
 (using Solr 4.9)





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Engage custom hit collector for special search processing

2015-01-13 Thread Alexandre Rafalovitch
Sounds like:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/

The main issue is your multi-field criteria. So you may need to
extend/overwrite the comparison method. Plus you'd need to keep the
counts. Which you should know since you are doing the filtering.

Is this the right direction for what you need?

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 13 January 2015 at 16:29, tedsolr tsm...@sciquest.com wrote:
 I have a complicated problem to solve, and I don't know enough about
 lucene/solr to phrase the question properly. This is kind of a shot in the
 dark. My requirement is to return search results always in completely
 collapsed form, rolling up duplicates with a count. Duplicates are defined
 by whatever fields are requested. If the search requests fields A, B, C,
 then all matched documents that have identical values for those 3 fields are
 dupes. The field list may change with every new search request. What I do
 know is the super set of all fields that may be part of the field list at
 index time.

 I know this can't be done with configuration alone. It doesn't seem
 performant to retrieve all 1M+ docs and post process in Java. A very smart
 person told me that a custom hit collector should be able to do the
 filtering for me. So, maybe I create a custom search handler that somehow
 exposes this custom hit collector that can use FieldCache or DocValues to
 examine all the matches and filter the results in the way I've described
 above.

 So assuming this is a viable solution path, can anyone suggest some helpful
 posts, code fragments, books for me to review? I admit to being out of my
 depth, but this requirement isn't going away. I'm grasping for straws right
 now.

 thanks
 (using Solr 4.9)





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr unit tests intermittently fail with error: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/security/CertificateUtils

2015-01-13 Thread brian4
The problem is the jetty-util version included in the Solr build is 6.1.26,
but this particular package is from version 7+.  Looks like it is a bug in
the build files for Solr.

I fixed it by downloading jetty 7 separately and manually adding
jetty-util-7.6.16.v20140903.jar to the end of my classpath.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-unit-tests-intermittently-fail-with-error-java-lang-NoClassDefFoundError-org-eclipse-jetty-utils-tp4175652p4179356.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Hoss,

This is starting to sound pretty complicated. Are you saying this is not
doable with Solr 4.10?
...or at least: that's how it *should* work :)   makes me a bit nervous
about trying this on my own.

Should I open a JIRA issue or am I probably the only person with a use case
for replacing a TermIndexInterval setting with changing the min and max
block size on the 41 postings format?



Tom








On Tue, Jan 13, 2015 at 3:16 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : ...the nuts  bolts of it is that the PostingFormat baseclass should take
 : care of all the SPI name registration that you need based on what you
 : pass to the super() construction ... allthough now that i think about it,
 : i'm not sure how you'd go about specifying your own name for the
 : PostingFormat when also doing something like subclassing
 : Lucene41PostingsFormat ... there's no Lucene41PostingsFormat constructor
 : you can call from your subclass to override the name.
 :
 : not sure what the expectation is there in the java API.

 ok, so i talked this through with mikemccand on IRC...

 in 4x, the API is actaully really dangerous - you can subclass things like
 Lucene41PostingsFormat w/o overriding the name used in SPI, and might
 really screw things up as far as what class is used to read back your
 files later.

 in the 5.0 APIs, these non-abstract codec related classes are all final to
 prevent exactly this type of behavior - but you can still use the
 constructor args to change behavior related to *writing* the index, and
 the classes all are designed to be smart enough that when they are loaded
 by SPI at search time, they can make sense of what's on disk (regardlessof
 wether non-default constructor args were used at index time)

 but the question remains: where does that leave you as a solr user who
 wants to write a plugin, since Solr only allows you to configure the SPI
 name (no constructor args) via 'postingFormat=foo'

 the anwser is that instead of writing a subclass, you would have to write
 a small proxy class, something like...


 public final class MyPfWrapper extends PostingFormat {
   PostingFormat pf = new Lucene50PostingsFormat(42, 9);
   public MyPfWrapper() {
 super(MyPfWrapper);
   }
   public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
 IOException {
 return pf.fieldsConsumer(state);
   }
   public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
 IOException {
 return pf.fieldsConsumer(state);
   }
   public FieldsProducer fieldsProducer(SegmentReadState state) throws
 IOException {
 return pf.fieldsProducer(state);
   }
 }

 ..and then refer to it with postingFormat=MyPfWrapper

 at index time, Solr will use SPI to find your MyPfWrapper class, which
 will delegate to an instance of Lucene50PostingsFormat constructed with
 the overriden constants, and then at query time the SegmentReader code
 paths will use SPI to find MyPfWrapper by name as well, and it will again
 delegate to Lucene50PostingsFormat for reading back the index.


 or at least: that's how it *should* work :)




 -Hoss
 http://www.lucidworks.com/



Re: Engage custom hit collector for special search processing

2015-01-13 Thread Joel Bernstein
You may also want to take a look at how AnalyticsQueries can be plugged in.
This won't show you how to do the implementation but it will show you how
you can plugin a custom collector.

http://heliosearch.org/solrs-new-analyticsquery-api/
http://heliosearch.org/solrs-mergestrategy/

Joel Bernstein
Search Engineer at Heliosearch

On Tue, Jan 13, 2015 at 4:45 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Sounds like:

 https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

 http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/

 The main issue is your multi-field criteria. So you may need to
 extend/overwrite the comparison method. Plus you'd need to keep the
 counts. Which you should know since you are doing the filtering.

 Is this the right direction for what you need?

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 13 January 2015 at 16:29, tedsolr tsm...@sciquest.com wrote:
  I have a complicated problem to solve, and I don't know enough about
  lucene/solr to phrase the question properly. This is kind of a shot in
 the
  dark. My requirement is to return search results always in completely
  collapsed form, rolling up duplicates with a count. Duplicates are
 defined
  by whatever fields are requested. If the search requests fields A, B, C,
  then all matched documents that have identical values for those 3 fields
 are
  dupes. The field list may change with every new search request. What I
 do
  know is the super set of all fields that may be part of the field list at
  index time.
 
  I know this can't be done with configuration alone. It doesn't seem
  performant to retrieve all 1M+ docs and post process in Java. A very
 smart
  person told me that a custom hit collector should be able to do the
  filtering for me. So, maybe I create a custom search handler that somehow
  exposes this custom hit collector that can use FieldCache or DocValues to
  examine all the matches and filter the results in the way I've described
  above.
 
  So assuming this is a viable solution path, can anyone suggest some
 helpful
  posts, code fragments, books for me to review? I admit to being out of my
  depth, but this requirement isn't going away. I'm grasping for straws
 right
  now.
 
  thanks
  (using Solr 4.9)
 
 
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html
  Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Chris Hostetter

: This is starting to sound pretty complicated. Are you saying this is not
: doable with Solr 4.10?

it should be doable in 4.10, using a wrapper class like the one i 
mentioned below (delegating to Lucene51PostingsFormat instead of 
Lucene50PostingsFormat) ... it's just that the 4.10 APIs are dangerous and 
let malicious/foolish java devs do scary things they shouldn't do.  but 
what i outlined before (Below) is intended to work, and should continue to 
work in 5.x.

: ...or at least: that's how it *should* work :)   makes me a bit nervous
: about trying this on my own.

...worst case scenerio, i overlooked something - but all it would take to 
verify that it's working is to try it at small scale: write the class, 
configure it, index a handful of docs, shutdown  restart solr, and see if 
your index opens  is correctly searchable -- if it is, then i didn't 
overlook anything, if it isn't then there is a bug somewhere and details 
of your experiement with your custom posting format (ie wrapper class) 
source in JIRA would be helpful.

: Should I open a JIRA issue or am I probably the only person with a use case
: for replacing a TermIndexInterval setting with changing the min and max
: block size on the 41 postings format?

you're the only person i've ever seen ask about it :)


:  public final class MyPfWrapper extends PostingFormat {
:PostingFormat pf = new Lucene50PostingsFormat(42, 9);
:public MyPfWrapper() {
:  super(MyPfWrapper);
:}
:public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
:  IOException {
:  return pf.fieldsConsumer(state);
:}
:public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws
:  IOException {
:  return pf.fieldsConsumer(state);
:}
:public FieldsProducer fieldsProducer(SegmentReadState state) throws
:  IOException {
:  return pf.fieldsProducer(state);
:}
:  }
: 
:  ..and then refer to it with postingFormat=MyPfWrapper


-Hoss
http://www.lucidworks.com/