date:20131018

Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
at
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hi Roland,


 (13/10/17 20:44), Roland Everaert wrote:

 Hi,

 I helped a customer to deployed solr+manifoldCF and everything is going
 quite smoothly, but every time solr is raising an exception, the
 manifoldcfjob feeding

 solr aborts. I would like to know if it is possible to configure the
 ExtractRequestHandler to ignore errors like it seems to be possible with
 dataimporthandler and entity processors.

 I know that it is possible to configure the ExtractRequestHandler to
 ignore
 tika exception (We already do that) but the errors that now stops the
 mcfjobs are generated by

 solr itself.

 While it is interesting to have such option in solr, I plan to post to the
 manifoldcf mailing list, anyway, to know if it is possible to configure
 manifolcf to be less picky about solr errors.


 ignoreTikaException flag might help you?

 https://issues.apache.org/**jira/browse/SOLR-2480https://issues.apache.org/jira/browse/SOLR-2480

 koji
 --
 http://soleami.com/blog/**automatically-acquiring-**
 synonym-knowledge-from-**wikipedia.htmlhttp://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Proximity search with wildcard

2013-10-18 Thread sayeed

Hi,
I am new to solr. Is it possible to do proximity search with solr.

For example 
comp* engage~5.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html
Sent from the Solr - User mailing list archive at Nabble.com.

Complex Queries in solr

2013-10-18 Thread sayeed

Hi,
Is it possible to search complex queries like 
(consult* or advis*) NEAR(40) (fee or retainer or salary or bonus)
in solr




-
Sayeed
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Complex-Queries-in-solr-tp4096288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solrconfig.xml carrot2 params

2013-10-18 Thread Stanislaw Osinski

Hi,

Out of curiosity -- what would you like to achieve by changing
Tokenizer.documentFields?
If you want to have clustering applied to more than one document field, you
can provide a comma-separated list of fields in the carrot.title and/or
carrot.snippet parameters.

Thanks,

Staszek

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com


On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net 
youknow...@heroicefforts.net wrote:

 Would someone help me out with the syntax for setting
 Tokenizer.documentFields in the ClusteringComponent engine definition in
 solrconfig.xml?  Carrot2 is expecting a Collection of Strings.  There's no
 schema definition for this XML file and a big TODO on the Wiki wrt init
 params.  Every permutation I have tried results in an error stating:
  Cannot set java.until.Collection field ... to java.lang.String.
 --
 Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Re: Proximity search with wildcard

2013-10-18 Thread Harshvardhan Ojha

Hi Sayeed,

you can use fuzzy search. comp engage~0.2.

Regards
harshvardhan ojha


On Fri, Oct 18, 2013 at 10:28 AM, sayeed abdulsayeed...@gmail.com wrote:

 Hi,
 I am new to solr. Is it possible to do proximity search with solr.

 For example
 comp* engage~5.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html
 Sent from the Solr - User mailing list archive at Nabble.com.

how to retireve content page in solr

2013-10-18 Thread javozzo

Hi, i'm new in solr.
I use Nutch 1.1 to crawl web pages. 
I use solr to indexer these pages. 
My problem is: how to retrieve the content information about a document
stored il solr?

Example
If I have a page http://www.prova.com/prova.html
that contains the text This is a web page

Is there a way to retrieve the text This is a web page?
Any ideas?
My application is written in java.
Thanks
Danilo



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler, skipping errors

2013-10-18 Thread Koji Sekiguchi


Hi,

I think the flag cannot ignore NoSuchMethodError. There may be something wrong 
here?

... I've just checked my Solr 4.5 directories and I found Tika version is 1.4.

Tika 1.4 seems to use commons compress 1.5:

http://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory.

Can you open a JIRA issue?

For now, you can get commons compress 1.5 and put it to the directory
(don't forget to remove 1.4.1 jar file).

koji

(13/10/18 16:37), Roland Everaert wrote:

Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
 at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V
 at
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
 ... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hi Roland,


(13/10/17 20:44), Roland Everaert wrote:


Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding

solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the ExtractRequestHandler to
ignore
tika exception (We already do that) but the errors that now stops the
mcfjobs are generated by

solr itself.

While it is interesting to have such option in solr, I plan to post to the
manifoldcf mailing list, anyway, to know if it is possible to configure
manifolcf to be less picky about solr errors.



ignoreTikaException flag might help you?

https://issues.apache.org/**jira/browse/SOLR-2480https://issues.apache.org/jira/browse/SOLR-2480

koji
--
http://soleami.com/blog/**automatically-acquiring-**

Re: how to retireve content page in solr

2013-10-18 Thread Harshvardhan Ojha

Hi Danila,

What do you mean by content information?
A whole document?
Metadata?
do you keep it separate in some fields?
Or is it about solr search queries?


Regards
Harshvardhan Ojha


On Fri, Oct 18, 2013 at 1:09 PM, javozzo danilo.domen...@gmail.com wrote:

 Hi, i'm new in solr.
 I use Nutch 1.1 to crawl web pages.
 I use solr to indexer these pages.
 My problem is: how to retrieve the content information about a document
 stored il solr?

 Example
 If I have a page http://www.prova.com/prova.html
 that contains the text This is a web page

 Is there a way to retrieve the text This is a web page?
 Any ideas?
 My application is written in java.
 Thanks
 Danilo



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Debugging update request

2013-10-18 Thread Erick Erickson

@Michael:

Yep, that's the bit that's addressed by the two patches I referenced. If
you can try this with 4.5 (or the soon to be done 4.5.1), the problem
should go away.

@Chris:

I think you have a different issue. A very quick glance at your stack trace
doesn't really show anything outstanding. There are always a bunch of
threads waiting around for something to do that show up as blocked. So
I'm pretty puzzled. Are your Solr logs showing anything when you try to
update after this occurs?


On Wed, Oct 16, 2013 at 11:32 AM, Chris Geeringh geeri...@gmail.com wrote:

 Here is my jstack output... Lots of blocked threads.

 http://pastebin.com/1ktjBYbf


 On 16 October 2013 10:28, michael.boom my_sky...@yahoo.com wrote:

  I got the trace from jstack.
  I found references to semaphore but not sure if this is what you meant.
  Here's the trace:
  http://pastebin.com/15QKAz7U
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Debugging-update-request-tp4095619p4095847.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Concurent indexing

2013-10-18 Thread Erick Erickson

Chris:

OK, one of those stack traces does have the problem I referenced in the
other thread. Are you sending updates to the server with SolrJ? And are you
using CloudSolrServer? If you are, I'm surprised...

 There are the important lines:

   1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled
   frame)
   2.  - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61
   (Compiled frame)
   3.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
   update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame)
   4.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
   client.solrj.request.UpdateRequest,





On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh geeri...@gmail.com wrote:

 Here's another jstack http://pastebin.com/8JiQc3rb


 On 16 October 2013 11:53, Chris Geeringh geeri...@gmail.com wrote:

  Hi Erick, here is a paste from other thread (debugging update request)
  with my input as I am seeing errors too:
 
  I ran an import last night, and this morning my cloud wouldn't accept
  updates. I'm running the latest 4.6 snapshot. I was importing with latest
  solrj snapshot, and using java bin transport with CloudSolrServer.
 
  The cluster had indexed ~1.3 million docs before no further updates were
  accepted, querying still working.
 
  I'll run jstack shortly and provide the results.
 
  Here is my jstack output... Lots of blocked threads.
 
  http://pastebin.com/1ktjBYbf
 
 
 
  On 16 October 2013 11:46, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Run jstack on the solr process (standard with Java) and
  look for the word semaphore. You should see your
  servers blocked on this in the Solr code. That'll pretty
  much nail it.
 
  There's an open JIRA to fix the underlying cause, see:
  SOLR-5232, but that's currently slated for 4.6 which
  won't be cut for a while.
 
  Also, there's a patch that will fix this as a side effect,
  assuming you're using SolrJ, see. This is available in 4.5
  SOLR-4816
 
  Best,
  Erick
 
 
 
 
  On Tue, Oct 15, 2013 at 1:33 PM, michael.boom my_sky...@yahoo.com
  wrote:
 
   Here's some of the Solr's last words (log content before it stoped
   accepting
   updates), maybe someone can help me interpret that.
   http://pastebin.com/mv7fH62H
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html
   Sent from the Solr - User mailing list archive at Nabble.com.

Re: measure result set quality

2013-10-18 Thread Erick Erickson

bq: How do you compare the quality of your
search result in order to decide which schema is better?

Well, that's actually a hard problem. There's the
various TREC data, but that's a generic solution and most
every individual application of this generic thing called
search has its own version of good results.

Note that scores are NOT comparable across different
queries even in the same data set, so don't go down that
path.

I'd fire the question back at you, Can you define what
good (or better) results are in such a way that you can
program an evaluation? Often the answer is no...

One common technique is to have knowledgable users
do what's called A/B testing. You fire the query at two
separate Solr instances and display the results side-by-side,
and the user says A is more relevant, or B is more
relevant. Kind of like an eye doctor. In sophisticated A/B
testing, the program randomly changes which side the
results go, so you remove sidedness bias.


FWIW,
Erick


On Thu, Oct 17, 2013 at 11:28 AM, Alvaro Cabrerizo topor...@gmail.comwrote:

 Hi,

 Imagine the next situation. You have a corpus of documents and a list of
 queries extracted from production environment. The corpus haven't been
 manually annotated with relvant/non relevant tags for every query. Then you
 configure various solr instances changing the schema (adding synonyms,
 stopwords...). After indexing, you prepare and execute the test over
 different schema configurations.  How do you compare the quality of your
 search result in order to decide which schema is better?

 Regards.

XLSB files not indexed

Hi,

Can someone tells me if tika is supposed to extract data from xlsb files
(the new MS Office format in binary form)?

If so then it seems that solr is not able to index them like it is not able
to index ODF files (a JIRA is already opened for ODF
https://issues.apache.org/jira/browse/SOLR-4809)

Can someone confirm the problem, or tell me what to do to make solr works
with XLSB files.


Regards,


Roland.

Re: ExtractRequestHandler, skipping errors

I will open a JIRA issue, I suppose that I just have to create an account
first?


Regards,


Roland.


On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hi,

 I think the flag cannot ignore NoSuchMethodError. There may be something
 wrong here?

 ... I've just checked my Solr 4.5 directories and I found Tika version is
 1.4.

 Tika 1.4 seems to use commons compress 1.5:

 http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
 pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

 But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
 directory.

 Can you open a JIRA issue?

 For now, you can get commons compress 1.5 and put it to the directory
 (don't forget to remove 1.4.1 jar file).

 koji


 (13/10/18 16:37), Roland Everaert wrote:

 Hi,

 We already configure the extractrequesthandler to ignore tika exceptions,
 but it is solr that complains. The customer manage to reproduce the
 problem. Following is the error from the solr.log. The file type cause
 this
 exception was WMZ. It seems that something is missing in a solr class. We
 use SOLR 4.4.

 ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
 null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
 org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
 setDecompressConcatenated(Z)V
  at
 org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
 SolrDispatchFilter.java:673)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:383)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:158)
  at
 org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
 ApplicationFilterChain.java:**243)
  at
 org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
 ApplicationFilterChain.java:**210)
  at
 org.apache.catalina.core.**StandardWrapperValve.invoke(**
 StandardWrapperValve.java:222)
  at
 org.apache.catalina.core.**StandardContextValve.invoke(**
 StandardContextValve.java:123)
  at
 org.apache.catalina.core.**StandardHostValve.invoke(**
 StandardHostValve.java:171)
  at
 org.apache.catalina.valves.**ErrorReportValve.invoke(**
 ErrorReportValve.java:99)
  at
 org.apache.catalina.valves.**AccessLogValve.invoke(**
 AccessLogValve.java:953)
  at
 org.apache.catalina.core.**StandardEngineValve.invoke(**
 StandardEngineValve.java:118)
  at
 org.apache.catalina.connector.**CoyoteAdapter.service(**
 CoyoteAdapter.java:408)
  at
 org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
 AbstractHttp11Processor.java:**1023)
  at
 org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
 process(AbstractProtocol.java:**589)
  at
 org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
 run(AprEndpoint.java:1852)
  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
 Source)
  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
 Source)
  at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.NoSuchMethodError:
 org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
 setDecompressConcatenated(Z)V
  at
 org.apache.tika.parser.pkg.**CompressorParser.parse(**
 CompressorParser.java:102)
  at
 org.apache.tika.parser.**CompositeParser.parse(**
 CompositeParser.java:242)
  at
 org.apache.tika.parser.**CompositeParser.parse(**
 CompositeParser.java:242)
  at
 org.apache.tika.parser.**AutoDetectParser.parse(**
 AutoDetectParser.java:120)
  at
 org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
 ExtractingDocumentLoader.java:**219)
  at
 org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(**
 ContentStreamHandlerBase.java:**74)
  at
 org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
 RequestHandlerBase.java:135)
  at
 org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
 handleRequest(RequestHandlers.**java:241)
  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.execute(**
 SolrDispatchFilter.java:659)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:362)
  ... 16 more





 On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp
 wrote:

  Hi Roland,


 (13/10/17 20:44), Roland Everaert wrote:

  Hi,

 I helped a customer to deployed solr+manifoldCF and everything is going
 quite smoothly, but every time solr is raising an exception, the
 manifoldcfjob feeding

 solr aborts. I would like to know if it is possible to configure the
 ExtractRequestHandler to ignore errors like it seems to be possible with
 dataimporthandler and entity processors.

 I know that it is possible to configure the ExtractRequestHandler to
 ignore
 tika exception (We already do that) but the errors that

Re: ExtractRequestHandler, skipping errors

Here is the link to the issue:

https://issues.apache.org/jira/browse/SOLR-5365

Thanks for your help.


Roland Everaert.


On Fri, Oct 18, 2013 at 1:40 PM, Roland Everaert reveatw...@gmail.comwrote:

 I will open a JIRA issue, I suppose that I just have to create an account
 first?


 Regards,


 Roland.


 On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jpwrote:

 Hi,

 I think the flag cannot ignore NoSuchMethodError. There may be something
 wrong here?

 ... I've just checked my Solr 4.5 directories and I found Tika version is
 1.4.

 Tika 1.4 seems to use commons compress 1.5:

 http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
 pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

 But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
 directory.

 Can you open a JIRA issue?

 For now, you can get commons compress 1.5 and put it to the directory
 (don't forget to remove 1.4.1 jar file).

 koji


 (13/10/18 16:37), Roland Everaert wrote:

 Hi,

 We already configure the extractrequesthandler to ignore tika exceptions,
 but it is solr that complains. The customer manage to reproduce the
 problem. Following is the error from the solr.log. The file type cause
 this
 exception was WMZ. It seems that something is missing in a solr class. We
 use SOLR 4.4.

 ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
 null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
 org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
 setDecompressConcatenated(Z)V
  at
 org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
 SolrDispatchFilter.java:673)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:383)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:158)
  at
 org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
 ApplicationFilterChain.java:**243)
  at
 org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
 ApplicationFilterChain.java:**210)
  at
 org.apache.catalina.core.**StandardWrapperValve.invoke(**
 StandardWrapperValve.java:222)
  at
 org.apache.catalina.core.**StandardContextValve.invoke(**
 StandardContextValve.java:123)
  at
 org.apache.catalina.core.**StandardHostValve.invoke(**
 StandardHostValve.java:171)
  at
 org.apache.catalina.valves.**ErrorReportValve.invoke(**
 ErrorReportValve.java:99)
  at
 org.apache.catalina.valves.**AccessLogValve.invoke(**
 AccessLogValve.java:953)
  at
 org.apache.catalina.core.**StandardEngineValve.invoke(**
 StandardEngineValve.java:118)
  at
 org.apache.catalina.connector.**CoyoteAdapter.service(**
 CoyoteAdapter.java:408)
  at
 org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
 AbstractHttp11Processor.java:**1023)
  at
 org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
 process(AbstractProtocol.java:**589)
  at
 org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
 run(AprEndpoint.java:1852)
  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
 Source)
  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
 Source)
  at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.NoSuchMethodError:
 org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
 setDecompressConcatenated(Z)V
  at
 org.apache.tika.parser.pkg.**CompressorParser.parse(**
 CompressorParser.java:102)
  at
 org.apache.tika.parser.**CompositeParser.parse(**
 CompositeParser.java:242)
  at
 org.apache.tika.parser.**CompositeParser.parse(**
 CompositeParser.java:242)
  at
 org.apache.tika.parser.**AutoDetectParser.parse(**
 AutoDetectParser.java:120)
  at
 org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
 ExtractingDocumentLoader.java:**219)
  at
 org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(*
 *ContentStreamHandlerBase.java:**74)
  at
 org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
 RequestHandlerBase.java:135)
  at
 org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
 handleRequest(RequestHandlers.**java:241)
  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.execute(**
 SolrDispatchFilter.java:659)
  at
 org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
 SolrDispatchFilter.java:362)
  ... 16 more





 On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp
 wrote:

  Hi Roland,


 (13/10/17 20:44), Roland Everaert wrote:

  Hi,

 I helped a customer to deployed solr+manifoldCF and everything is going
 quite smoothly, but every time solr is raising an exception, the
 manifoldcfjob feeding

 solr aborts. I would like to know if it is possible to configure the
 ExtractRequestHandler to ignore errors like

Re: ExtractRequestHandler, skipping errors

2013-10-18 Thread Guido Medina

Dont, commons compress 1.5 is broken, either use 1.4.1 or later. Our app 
stopped compressing properly for a maven update.


Guido.

On 18/10/13 12:40, Roland Everaert wrote:

I will open a JIRA issue, I suppose that I just have to create an account
first?


Regards,


Roland.


On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:


Hi,

I think the flag cannot ignore NoSuchMethodError. There may be something
wrong here?

... I've just checked my Solr 4.5 directories and I found Tika version is
1.4.

Tika 1.4 seems to use commons compress 1.5:

http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/**
pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup

But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/
directory.

Can you open a JIRA issue?

For now, you can get commons compress 1.5 and put it to the directory
(don't forget to remove 1.4.1 jar file).

koji


(13/10/18 16:37), Roland Everaert wrote:


Hi,

We already configure the extractrequesthandler to ignore tika exceptions,
but it is solr that complains. The customer manage to reproduce the
problem. Following is the error from the solr.log. The file type cause
this
exception was WMZ. It seems that something is missing in a solr class. We
use SOLR 4.4.

ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException;
null:java.lang.**RuntimeException: java.lang.NoSuchMethodError:
org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
setDecompressConcatenated(Z)V
  at
org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
SolrDispatchFilter.java:673)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:383)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:158)
  at
org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
ApplicationFilterChain.java:**243)
  at
org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
ApplicationFilterChain.java:**210)
  at
org.apache.catalina.core.**StandardWrapperValve.invoke(**
StandardWrapperValve.java:222)
  at
org.apache.catalina.core.**StandardContextValve.invoke(**
StandardContextValve.java:123)
  at
org.apache.catalina.core.**StandardHostValve.invoke(**
StandardHostValve.java:171)
  at
org.apache.catalina.valves.**ErrorReportValve.invoke(**
ErrorReportValve.java:99)
  at
org.apache.catalina.valves.**AccessLogValve.invoke(**
AccessLogValve.java:953)
  at
org.apache.catalina.core.**StandardEngineValve.invoke(**
StandardEngineValve.java:118)
  at
org.apache.catalina.connector.**CoyoteAdapter.service(**
CoyoteAdapter.java:408)
  at
org.apache.coyote.http11.**AbstractHttp11Processor.**process(**
AbstractHttp11Processor.java:**1023)
  at
org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.**
process(AbstractProtocol.java:**589)
  at
org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.**
run(AprEndpoint.java:1852)
  at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown
Source)
  at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown
Source)
  at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoSuchMethodError:
org.apache.commons.compress.**compressors.**CompressorStreamFactory.**
setDecompressConcatenated(Z)V
  at
org.apache.tika.parser.pkg.**CompressorParser.parse(**
CompressorParser.java:102)
  at
org.apache.tika.parser.**CompositeParser.parse(**
CompositeParser.java:242)
  at
org.apache.tika.parser.**CompositeParser.parse(**
CompositeParser.java:242)
  at
org.apache.tika.parser.**AutoDetectParser.parse(**
AutoDetectParser.java:120)
  at
org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(**
ExtractingDocumentLoader.java:**219)
  at
org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(**
ContentStreamHandlerBase.java:**74)
  at
org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
RequestHandlerBase.java:135)
  at
org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.**
handleRequest(RequestHandlers.**java:241)
  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904)
  at
org.apache.solr.servlet.**SolrDispatchFilter.execute(**
SolrDispatchFilter.java:659)
  at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:362)
  ... 16 more





On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp
wrote:

  Hi Roland,


(13/10/17 20:44), Roland Everaert wrote:

  Hi,

I helped a customer to deployed solr+manifoldCF and everything is going
quite smoothly, but every time solr is raising an exception, the
manifoldcfjob feeding

solr aborts. I would like to know if it is possible to configure the
ExtractRequestHandler to ignore errors like it seems to be possible with
dataimporthandler and entity processors.

I know that it is possible to configure the

Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW

I am working with Solr facet fields and come across a 
performance problem I don't understand. Consider these 
two queries:

1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result) while 
the second one takes only 80 msec (QTime 80). Why is this?

And as side note: facet.method=fc makes the queries run 'forever' and 
eventually 
fail with org.apache.solr.common.SolrException: Too many values for 
UnInvertedField faceting on field CONTENT.

This is with Solr 1.4.

Re: feedback on Solr 4.x LotsOfCores feature

2013-10-18 Thread Soyez Olivier

15K cores is around 4 minutes : no network drive, just a spinning disk
But, one important thing, to simulate a cold start or an useless linux buffer 
cache,
I used the following command to empty the linux buffer cache :
sync  echo 3  /proc/sys/vm/drop_caches
Then, I started Solr and I found the result above


Le 11/10/2013 13:06, Erick Erickson a écrit :


bq: sharing the underlying solrconfig object the configset introduced
in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode

SOLR-4478 will NOT share the underlying config objects, it simply
shares the underlying directory. Each core will, at least as presently
envisioned, simply read the files that exist there and create their
own solrconfig object. Schema objects may be shared, but not config
objects. It may turn out to be relatively easy to do in the configset
situation, but last time I looked at sharing the underlying config
object it was too fraught with problems.

bq: 15K cores is around 4 minutes

I find this very odd. On my laptop, spinning disk, I think I was
seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I
have no idea what's going on here. If this is just reading the files,
you should be seeing horrible disk contention. Are you on some kind of
networked drive?

bq: To do that in background and to block on that request until core
discovery is complete, should not work for us (due to the worst case).
What other choices are there? Either you have to do it up front or
with some kind of blocking. Hmmm, I suppose you could keep some kind
of custom store (DB? File? ZooKeeper?) that would keep the last known
layout. You'd still have some kind of worst-case situation where the
core you were trying to load wouldn't be in your persistent store and
you'd _still_ have to wait for the discovery process to complete.

bq: and we will use the cores Auto option to create load or only load
the core on
Interesting. I can see how this could all work without any core
discovery but it does require a very specific setup.

On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier
olivier.so...@worldline.commailto:olivier.so...@worldline.com wrote:
 The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, 
 including the new Cores options :
 - numBuckets to create a subdirectory based on a hash on the corename % 
 numBuckets in the core Datadir
 - Auto with 3 differents values :
   1) false : default behaviour
   2) createLoad : create, if not exist, and load the core on the fly on the 
 first incoming request (update, select)
   3) onlyLoad : load the core on the fly on the first incoming request 
 (update, select), if exist on disk

 Concerning :
 - sharing the underlying solrconfig object, the configset introduced in JIRA 
 SOLR-4478 seems to be the solution for non-SolrCloud mode.
 We need to test it for our use case. If another solution exists, please tell 
 me. We are very interested in such functionality and to contribute, if we can.

 - the possibility of lotsOfCores in SolrCloud, we don't know in details how 
 SolrCloud is working.
 But one possible limit is the maximum number of entries that can be added to 
 a zookeeper node.
 Maybe, a solution will be just a kind of hashing in the zookeeper tree.

 - the time to discover cores in Solr 4.4 : with spinning disk under linux, 
 all cores with transient=true and loadOnStartup=false, the linux buffer 
 cache empty before starting Solr :
 15K cores is around 4 minutes. It's linear in the cores number, so for 50K 
 it's more than 13 minutes. In fact, it corresponding to the time to read all 
 core.properties files.
 To do that in background and to block on that request until core discovery is 
 complete, should not work for us (due to the worst case).
 So, we will just disable the core Discovery, because we don't need to know 
 all cores from the start. Start Solr without any core entries in solr.xml, 
 and we will use the cores Auto option to create load or only load the core on 
 the fly, based on the existence of the core on the disk (absolute path 
 calculated from the core name).

 Thanks for your interest,

 Olivier
 
 De : Erick Erickson [erickerick...@gmail.commailto:erickerick...@gmail.com]
 Date d'envoi : lundi 7 octobre 2013 14:33
 À : solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Objet : Re: feedback on Solr 4.x LotsOfCores feature

 Thanks for the great writeup! It's always interesting to see how
 a feature plays out in the real world. A couple of questions
 though:

 bq: We added 2 Cores options :
 Do you mean you patched Solr? If so are you willing to shard the code
 back? If both are yes, please open a JIRA, attach the patch and assign
 it to me.

 bq:  the number of file descriptors, it used a lot (need to increase global
 max and per process fd)

 Right, this makes sense since you have a bunch of cores all with their
 own descriptors open. I'm assuming that you hit a rather high max
 number and

Re: Proximity search with wildcard

2013-10-18 Thread sayeed

Generally in solr if we give Company engage~5  it will give the results
containing engage 5 words near to the company. 
So here I want to get the results if i gave the query  with wildcard as
Compa* engage~5



-
Sayeed
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285p4096354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filter cache pollution during sharded edismax queries

2013-10-18 Thread Anca Kopetz

Hi Ken,

Have you managed to find out why these entries were stored into filterCache and
if they have an impact on the hit ratio ?
We noticed the same problem, there are entries of this type :
item_+(+(title:western^10.0 | ... in our filterCache.

Thanks,
Anca

On 07/02/2013 09:01 PM, Ken Krugler wrote:

Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had
dropped significantly.

Previously it was at 95+%, but now it's 50%.

I enabled recording 100 entries for debugging, and in looking at them it seems
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string bogus text using edismax on two fields, I get
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my
fields are single-value/untokenized strings, and I'm not using the enum facet
method.

But I'll get many, many entries in the filterCache for facet values, and they all look like
item_facet field:facet value:

The net result of the above is that even with a very big filterCache size of
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions training
Hadoop, Cascading, Cassandra Solr

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Concurent indexing

2013-10-18 Thread Chris Geeringh

Erick, yes. Using SolrJ and CloudSolrServer - both 4.6 snapshots from 13 Oct


On 18 October 2013 12:17, Erick Erickson erickerick...@gmail.com wrote:

 Chris:

 OK, one of those stack traces does have the problem I referenced in the
 other thread. Are you sending updates to the server with SolrJ? And are you
 using CloudSolrServer? If you are, I'm surprised...

  There are the important lines:

1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled
frame)
2.  - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61
(Compiled frame)
3.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame)
4.  - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.
client.solrj.request.UpdateRequest,





 On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh geeri...@gmail.com
 wrote:

  Here's another jstack http://pastebin.com/8JiQc3rb
 
 
  On 16 October 2013 11:53, Chris Geeringh geeri...@gmail.com wrote:
 
   Hi Erick, here is a paste from other thread (debugging update request)
   with my input as I am seeing errors too:
  
   I ran an import last night, and this morning my cloud wouldn't accept
   updates. I'm running the latest 4.6 snapshot. I was importing with
 latest
   solrj snapshot, and using java bin transport with CloudSolrServer.
  
   The cluster had indexed ~1.3 million docs before no further updates
 were
   accepted, querying still working.
  
   I'll run jstack shortly and provide the results.
  
   Here is my jstack output... Lots of blocked threads.
  
   http://pastebin.com/1ktjBYbf
  
  
  
   On 16 October 2013 11:46, Erick Erickson erickerick...@gmail.com
  wrote:
  
   Run jstack on the solr process (standard with Java) and
   look for the word semaphore. You should see your
   servers blocked on this in the Solr code. That'll pretty
   much nail it.
  
   There's an open JIRA to fix the underlying cause, see:
   SOLR-5232, but that's currently slated for 4.6 which
   won't be cut for a while.
  
   Also, there's a patch that will fix this as a side effect,
   assuming you're using SolrJ, see. This is available in 4.5
   SOLR-4816
  
   Best,
   Erick
  
  
  
  
   On Tue, Oct 15, 2013 at 1:33 PM, michael.boom my_sky...@yahoo.com
   wrote:
  
Here's some of the Solr's last words (log content before it stoped
accepting
updates), maybe someone can help me interpret that.
http://pastebin.com/mv7fH62H
   
   
   
--
View this message in context:
   
  
 
 http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html
Sent from the Solr - User mailing list archive at Nabble.com.

querying nested entity fields

2013-10-18 Thread sathish_ix

Hi ,

can some help if below query is possible,

Schema:

tag
categoryA
productproduct1/product
productproduct2/product
/category
categoryB
productproduct12/product
productproduct23/product
/category
/tag

Is it possible to like this q=tag.category:A AND
tag.category.product=product1 ???






--
View this message in context: 
http://lucene.472066.n3.nabble.com/querying-nested-entity-fields-tp4096382.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Facet performance

2013-10-18 Thread Toke Eskildsen

Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in the 
disk cache.

Furthermore, for enum the difference between no prefix and some prefix is huge. 
As enum iterates values first (as opposed to fc that iterates hits first), 
limiting to only the values that starts with 'a' ought to speed up retrieval by 
a factor 10 or more.

 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of possible 
unique values when using fc. It is not a bug as such, but more a consequence of 
a choice. Unfortunately the enum-solution is normally quite slow when there are 
enough unique values to trigger the too many values-exception. I know too 
little about the structures for DocValues to say if they will help here, but 
you might want to take a look at those.

- Toke Eskildsen

Re: solrconfig.xml carrot2 params

2013-10-18 Thread youknowwho


Thanks, I'm new to the clustering libraries.  I finally made this connection 
when I started browsing through the carrot2 source.  I had pulled down a 
smaller MM document collection from our test environment.  It was not ideal as 
it was mostly structured, but small.  I foolishly thought I could cluster on 
the text copy field before realizing that it was index only.  Doh!
 
Our documents are indexed in SolrCloud, but stored in HBase.  I want to allow 
users to page through Solr hits, but would like to cluster on all (or at least 
several thousand) of the top search hits.  Now I'm puzzling over how to 
efficiently cluster over possibly several thousand Solr hits when the documents 
are in HBase.  I thought an HBase coprocessor, but carrot2 isn't designed for 
distributed computation.  Mahout, in the Hadoop M/R context, seems slow and 
heavy handed for this scale; maybe, I just need to dig deeper into their 
library.  Or I could just be missing something fundamental?  :)
 
-Original Message-
From: Stanislaw Osinski stanislaw.osin...@carrotsearch.com
Sent: Friday, October 18, 2013 5:04am
To: solr-user@lucene.apache.org
Subject: Re: solrconfig.xml carrot2 params



Hi,

Out of curiosity -- what would you like to achieve by changing
Tokenizer.documentFields?
If you want to have clustering applied to more than one document field, you
can provide a comma-separated list of fields in the carrot.title and/or
carrot.snippet parameters.

Thanks,

Staszek

--
Stanislaw Osinski, stanislaw.osin...@carrotsearch.com
http://carrotsearch.com


On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net 
youknow...@heroicefforts.net wrote:

 Would someone help me out with the syntax for setting
 Tokenizer.documentFields in the ClusteringComponent engine definition in
 solrconfig.xml?  Carrot2 is expecting a Collection of Strings.  There's no
 schema definition for this XML file and a big TODO on the Wiki wrt init
 params.  Every permutation I have tried results in an error stating:
  Cannot set java.until.Collection field ... to java.lang.String.
 --
 Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Re: how to retireve content page in solr

2013-10-18 Thread javozzo

hi Harshvardhan Ojha;
i'm using nutch 1.1 and solr 3.6.0.
I mean whole document. I try to create a search engine with nutch and solr
and i would obtain a interface like this:

name1
http://www.prova.com/name1.html
first rows of content document

name2
http://www.prova.com/name2.html
first rows of content document

name3
http://www.prova.com/name3.html
first rows of content document

any ideas?
Thanks
Danilo



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302p4096333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr timeout after reboot

2013-10-18 Thread michael.boom

I have a SolrCloud environment with 4 shards, each having a replica and a
leader. The index size is about 70M docs and 60Gb, running with Jetty +
Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM.

I'm using SolrMeter for stress testing.
If I restart Jetty and then try to use SolrMeter to bomb an instance with
queries, using a query per minute rate of 3000 then that solr instance
somehow timesout and I need to restart it again.
If instead of using 3000 qpm i startup slowly with 200 for a minute or two,
then 1800 and then 3000 everything is good.

I assume this happens because Solr is not warmed up.
What settings could I tweak so that Solr doesn't time out anymore when
getting many requests? Is there a way to limit how many req it can serve?



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html
Sent from the Solr - User mailing list archive at Nabble.com.

Fwd: Searching within list of regions with 1:1 document-region mapping

2013-10-18 Thread Sandeep Gupta

Hi,

I have a Solr index of around 100 million documents with each document
being given a region id growing at a rate of about 10 million documents per
month - the average document size being aronud 10KB of pure text. The total
number of region ids are themselves in the range of 2.5 million.

I want to search for a query with a given list of region ids. The number of
region ids in this list is usually around 250-300 (most of the time), but
can be upto 500, with a maximum cap of around 2000 ids in one request.


What is the best way to model such queries besides using an IN param in the
query, or using a Filter FQ in the query or some other means?


 If it may help, the index is on a VM with 4 virtual-cores and has
currently 4GB of Java memory allocated out of 16GB in the machine. The
number of queries do not exceed more than 1 per minute for now. If needed,
we can throw more hardware to the index - but the index will still be only
on a single machine for atleast 6 months.

Best Regards,
Sandeep Gupta

RE: Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW

Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
1.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
2.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result)
while
the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request
will be notably slower than the second as the facet values might not be in
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?

Furthermore, for enum the difference between no prefix and some prefix is
huge. As enum iterates values first (as opposed to fc that iterates hits
first), limiting to only the values that starts with 'a' ought to speed up
retrieval by a factor 10 or more.

Thanks. That is what we sort of figured but it's good to know for sure. Of
course it begs the question if there is a way to speed this up?

And as side note: facet.method=fc makes the queries run 'forever' and
eventually
fail with org.apache.solr.common.SolrException: Too many values for
UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of
possible unique values when using fc. It is not a bug as such, but more a
consequence of a choice. Unfortunately the enum-solution is normally quite
slow when there are enough unique values to trigger the too many
values-exception. I know too little about the structures for DocValues to say
if they will help here, but you might want to take a look at those.

What is DocValues? Haven't heard of it yet. And yes, the fc method was
terribly slow in a case where it did work. Something like 20 minutes whereas
enum returned within a few seconds.

Michael

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Bumping this one, any suggestions?
Looks like if() and exists() are meant to solve this problem, but I am
using it in a wrong way.

-Utkarsh


On Thu, Oct 17, 2013 at 1:16 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I trying to do this:

 if (US_offers_i exists):
fq=US_offers_i:[1 TO *]
 else:
fq=offers_count:[1 TO *]

 Where:
 US_offers_i is a dynamic field containing an int
 offers_count is a status field containing an int.

 I have tried this so far but it doesn't work:

 http://solr_server/solr/col1/select?
 q=iphone+5s 
 fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *])

 Also, there is a heavy performance penalty for this condition? I am
 planning to use this for all my queries.

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Issues with Language detection in Solr

2013-10-18 Thread vibhoreng04

Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -OrgName: EXPLOITS VALLEY
HIGHGREENWOOD,StreetLine1: 19 GREENWOOD AVE,   
StreetLine2: ,SOrgName: EXPLOITS VALLEY HIGHGREENWOOD,   
StandardizedStreetLine1: 19 GREENWOOD AVE,language_s: [ 
de]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+
  
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s   0.9  en
 
+Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Issues with Language detection in Solr

I would say that in general you need at least 15 or 20 words in a text field 
for language to be detected reasonably well. Sure, sometimes it can work for 
8 to 12 words, but flip a coin how reliable it will be.


You haven't shown us any true text fields. I would say that language 
detection against simple name fields is a misuse of the language detection 
feature. I mean, it is designed for larger blocks of text, not very short 
phrases.


See some examples in my e-book.

-- Jack Krupansky

-Original Message- 
From: vibhoreng04

Sent: Friday, October 18, 2013 2:01 PM
To: solr-user@lucene.apache.org
Subject: Issues with Language detection in Solr

Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -OrgName: EXPLOITS VALLEY
HIGHGREENWOOD,StreetLine1: 19 GREENWOOD AVE,
StreetLine2: ,SOrgName: EXPLOITS VALLEY HIGHGREENWOOD,
StandardizedStreetLine1: 19 GREENWOOD AVE,language_s: [
de]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s 0.9   en
+Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html
Sent from the Solr - User mailing list archive at Nabble.com.

Seeking New Moderators for solr-user@lucene



It looks like it's time to inject some fresh blood into the 
solr-user@lucene moderation team.


If you'd like to volunteer to be a moderator, please reply back to this 
thread and specify which email address you'd like to use as a moderator 
(if different from the one you use when sending the email)


Being a moderator is really easy: you'll get a some extra emails in your 
inbox with MODERATE in the subject, which you skim to see if they are spam 
-- if they are you delete them, if not you reply all to let them get 
sent to the list, and authorize that person to send future messages w/o 
moderation.


Occasionally, you'll see an explicit email to solr-user-owner@lucene from 
a user asking for help realted to their subscription (usually 
unsubscribing problems) and you and the other moderators chime in with 
assistance when possible.


More details can be found here...

https://wiki.apache.org/solr/MailingListModeratorInfo

(I'll wait ~72+ hours to see who responds, and then file the appropriate 
jira with INFRA)



-Hoss

Re: Switching indexes

2013-10-18 Thread Christopher Gross

I was able to get the new collections working dynamically (via Collections
RESTful calls). I was having some other issues with my development
environment that I had to fix up to get it going.

I had to upgrade to 4.5 in order for the aliases to work at all though.
Not sure what the deal was with that.

Thanks Shawn -- I have a much better understanding of all this now.

-- Chris

On Thu, Oct 17, 2013 at 7:31 PM, Shawn Heisey s...@elyograg.org wrote:

On 10/17/2013 12:51 PM, Christopher Gross wrote:

OK, super confused now.

http://index1:8080/solr/admin/**cores?action=CREATEname=**
test2collection=test2**numshards=1replicationFactor=**3http://index1:8080/solr/admin/cores?action=CREATEname=test2collection=test2numshards=1replicationFactor=3

Nets me this:
response
lst name=responseHeader
int name=status400/int
int name=QTime15007/int
/lst
lst name=error
str name=msgError CREATEing SolrCore 'test2': Could not find
configName
for collection test2 found:[xxx, xxx, , x, xx]/str
int name=code400/int
/lst
/response

For that node (test2), in my solr data directory, I have a folder with the
conf files and an existing data dir (copied the index from another
location).

Right now it seems like the only way that I can add in a collection is to
load the configs into zookeeper, stop tomcat, add it to the solr.xml file,
and restart tomcat.

The config does need to be loaded into zookeeper. That's how SolrCloud
works.

Because you have existing collections, you're going to have at least one
config set already uploaded, you may be able to use that directly. You
don't need to stop anything, though. Michael Della Bitta's response
indicates the part you're missing on your create URL - the
collection.configName parameter.

The basic way to get things done with collections is this:

1) Upload one or more named config sets to zookeeper. This can be done
with zkcli and its upconfig command, or with the bootstrap startup
options that are intended to be used once.

2) Create the collection, referencing the proper collection.configName.

You can have many collections that all share one config name. You can
also change which config an existing collection uses with the zkcli
linkconfig command, followed by a collection reload. If you upload a new
configuration with an existing name, a collection reload (or Solr restart)
is required to use the new config.

For uploading configs, I find zkcli to be a lot cleaner than the bootstrap
options - it doesn't require stopping Solr or giving it different startup
options. Actually, it doesn't even require Solr to be started - it talks
only to zookeeper, and we strongly recommend standalone zookeeper, not the
zk server that can be run embedded in Solr.

Thanks,
Shawn

Re: Check if dynamic columns exists and query else ignore


: I trying to do this:
: 
: if (US_offers_i exists):
:fq=US_offers_i:[1 TO *]
: else:
:fq=offers_count:[1 TO *]

if() and exist() are functions, so you would have to explicitly use 
them 
in a function context (ie: {!func} parser, or {!frange} parser) and to use 
those nested queries inside of functions you'd need to use the query() 
function.

but nothing about your problem description suggests that you really need 
to worry about this.

If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *] 
won't match that document, and neither will US_offers_i:[* TO *] -- so you 
can implement the logic you describe with a simple query...

fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *]))

Which you can read as Match does with 1 or more US offers, or: docs that 
have 1 or more offers but no US offer field at all

: Also, there is a heavy performance penalty for this condition? I am
: planning to use this for all my queries.

Any logic that you do at query time, which can be precomputed into a 
specific field in your index will *always* make the queries faster (at the 
expense of a little more time spent indexing and a little more disk used).  
If you know in advance that you are frequently going to want to ristrict 
on this type of logic, then unless you index docs more offten then you 
search docs, you should almost certainly index as has_offers boolean 
field that captures this logic.


-Hoss

Re: Issues with Language detection in Solr

2013-10-18 Thread vibhoreng04

I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case  case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Anshum Gupta

Hey Hoss,

I'd be happy to moderate.

Sent from my iPhone

 On 19-Oct-2013, at 0:22, Chris Hostetter hossman_luc...@fucit.org wrote:
 
 
 It looks like it's time to inject some fresh blood into the solr-user@lucene 
 moderation team.
 
 If you'd like to volunteer to be a moderator, please reply back to this 
 thread and specify which email address you'd like to use as a moderator (if 
 different from the one you use when sending the email)
 
 Being a moderator is really easy: you'll get a some extra emails in your 
 inbox with MODERATE in the subject, which you skim to see if they are spam -- 
 if they are you delete them, if not you reply all to let them get sent to 
 the list, and authorize that person to send future messages w/o moderation.
 
 Occasionally, you'll see an explicit email to solr-user-owner@lucene from a 
 user asking for help realted to their subscription (usually unsubscribing 
 problems) and you and the other moderators chime in with assistance when 
 possible.
 
 More details can be found here...
 
 https://wiki.apache.org/solr/MailingListModeratorInfo
 
 (I'll wait ~72+ hours to see who responds, and then file the appropriate jira 
 with INFRA)
 
 
 -Hoss

Re: Questions developing custom functionquery


: Field-Type: org.apache.solr.schema.TextField
...
: 
DocTermsIndexDocValueshttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-queries/4.3.0/org/apache/lucene/queries/function/docvalues/DocTermsIndexDocValues.java#DocTermsIndexDocValues.
: Calling getVal() on a DocTermsIndexDocValues does some really weird stuff
: that I really don't understand.

Your TextField is being analyzed in some way you haven't clarified, and 
the DocTermsIndexDocValues you get contains the details of each term in 
that TextField

: Its possible I'm going about this wrong and need to re-do my approach. I'm
: just currently at a loss for what that approach is.

Based on your initial goal, you are most certainly going about this in a 
much more complicated way then you need to...

:My goal is to be able to implement a custom sorting technique.

:Example: str name=resname/some
:example/data/here/2013/09/12/testing.text/str
:   
:I would like to do a custom sort based on this resname field.
:Basically, I would like to parse out that date there (2013/09/12) and
:   sort
:on that date.

You are going to be *MUCH* happier (both in terms of effort, and in terms 
of performance) if instead of writing a custom function to parse strings 
at query time when sorting, you implement the parsing logic when indexing 
the doc and index it up front as a date field that you can sort on.

I would suggest something like CloneFieldUpdateProcessorFactory + 
RegexReplaceProcessorFactory could save you the work of needing to 
implement any custom logic -- but as Jack pointed out in SOLR-4864 it 
doesn't currently allow you to do capture group replacements (but maybe 
you could contribute a patch to fix that instead of needing to write 
completely custom code for yourself)

Of maybe, as is, you could use RegexReplaceProcessorFactory to throw away 
non digits - and then use ParseDateFieldUpdateProcessorFactory to get what 
you want?  (I'm not certain - i haven't played with 
ParseDateFieldUpdateProcessorFactory much)

https://issues.apache.org/jira/browse/SOLR-4864
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html



-Hoss

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread vibhoreng04

Hi Chris,

I would like to moderate and you can use the mail id vibhoren...@gmail.com
for this purpose .


Regards,
Vibhor Jaiswal



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Seeking-New-Moderators-for-solr-user-lucene-tp4096447p4096448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Rafał Kuć

Hello!

I can help with moderation. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch


 It looks like it's time to inject some fresh blood into the 
 solr-user@lucene moderation team.

 If you'd like to volunteer to be a moderator, please reply back to this
 thread and specify which email address you'd like to use as a moderator
 (if different from the one you use when sending the email)

 Being a moderator is really easy: you'll get a some extra emails in your
 inbox with MODERATE in the subject, which you skim to see if they are spam
 -- if they are you delete them, if not you reply all to let them get
 sent to the list, and authorize that person to send future messages w/o
 moderation.

 Occasionally, you'll see an explicit email to solr-user-owner@lucene from
 a user asking for help realted to their subscription (usually 
 unsubscribing problems) and you and the other moderators chime in with
 assistance when possible.

 More details can be found here...

 https://wiki.apache.org/solr/MailingListModeratorInfo

 (I'll wait ~72+ hours to see who responds, and then file the appropriate
 jira with INFRA)


 -Hoss

Re: Facet performance

DocValues is the new black
http://wiki.apache.org/solr/DocValues

Otis
--
Solr ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm

On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
1.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
2.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result)
while
the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request
will be notably slower than the second as the facet values might not be in
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?

Thanks. That is what we sort of figured but it's good to know for sure. Of
course it begs the question if there is a way to speed this up?

And as side note: facet.method=fc makes the queries run 'forever' and
eventually
fail with org.apache.solr.common.SolrException: Too many values for
UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of
possible unique values when using fc. It is not a bug as such, but more a
consequence of a choice. Unfortunately the enum-solution is normally quite
slow when there are enough unique values to trigger the too many
values-exception. I know too little about the structures for DocValues to
say if they will help here, but you might want to take a look at those.

What is DocValues? Haven't heard of it yet. And yes, the fc method was
terribly slow in a case where it did work. Something like 20 minutes whereas
enum returned within a few seconds.

Michael

RE: Facet performance


:  1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
:  2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0
: 
:  The only difference is am empty facet.prefix in the first query.

: If you index was just opened when you issued your queries, the first 
: request will be notably slower than the second as the facet values might 
: not be in the disk cache.
: 
: I know but it shouldn't be orders of magnitudes as in this example, should it?

in and of itself: it can be if your index is large enough and none of the 
disk pages are in the file system buffer.

more significantly however, is that depending on how big your filterCache 
is, the first request could eaisly be caching all of filters needed for 
the second query -- at a minimum it's definitely caching your main query 
which will be re-used and save a lot of time independent of hte faceting.


-Hoss

SOLRJ replace document

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless updateLog/ is configured

I don't want to do partial updates, I just want to replace it...


Thanks,
Brent

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Thanks Chris! That worked!
I overengineered my query!

Thanks,
-Utkarsh


On Fri, Oct 18, 2013 at 12:02 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I trying to do this:
 :
 : if (US_offers_i exists):
 :fq=US_offers_i:[1 TO *]
 : else:
 :fq=offers_count:[1 TO *]

 if() and exist() are functions, so you would have to explicitly use
 them
 in a function context (ie: {!func} parser, or {!frange} parser) and to use
 those nested queries inside of functions you'd need to use the query()
 function.

 but nothing about your problem description suggests that you really need
 to worry about this.

 If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *]
 won't match that document, and neither will US_offers_i:[* TO *] -- so you
 can implement the logic you describe with a simple query...

 fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *]))

 Which you can read as Match does with 1 or more US offers, or: docs that
 have 1 or more offers but no US offer field at all

 : Also, there is a heavy performance penalty for this condition? I am
 : planning to use this for all my queries.

 Any logic that you do at query time, which can be precomputed into a
 specific field in your index will *always* make the queries faster (at the
 expense of a little more time spent indexing and a little more disk used).
 If you know in advance that you are frequently going to want to ristrict
 on this type of logic, then unless you index docs more offten then you
 search docs, you should almost certainly index as has_offers boolean
 field that captures this logic.


 -Hoss




-- 
Thanks,
-Utkarsh

loading djvu xml into solr

2013-10-18 Thread Sara Amato

Does anyone have a schema they'd be willing to share for loading djvu xml into 
solr?

Re: loading djvu xml into solr

2013-10-18 Thread Upayavira



On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote:
 Does anyone have a schema they'd be willing to share for loading djvu xml
 into solr?  

I assume that djvu XML is a particular XML format? In which case, there
is no schema that can do it. That's not how Solr works.

You need to use the XML format expected by Solr. Or, you can add
tr=.xsl to the URL, and use an XSL stylesheet to transform your XML
into Solr's XML format.

The schema defines the fields that are present in the index, not the
format of the XML used.

Upayavira

Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-10-18 Thread Jonatan Fournier

Hello,

I still have this issue using Solr 4.4, removing firstSearcher queries did
make the problem go away.

Note that I'm using Tomcat 7 and that if I'm using my own Java application
launching an Embedded Solr Server pointing to the same Solr configuration
the server fully starts with no hang.

What is the xml tag syntax to have spellcheck=false for firstSearcher
discussed above?

Cheers,

/jonatan

--- HANG with Tomcat 7 (firstSearcher queries on) ---
...
2409 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – No queryConverter
defined, using default converter
2409 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.component.QueryElevationComponent  – Loading
QueryElevation from: /var/lib/myapp/conf/elevate.xml
2415 [coreLoadExecutor-3-thread-3] INFO
 org.apache.solr.handler.ReplicationHandler  – Commits will be reserved for
 1
2415 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener sending requests to
Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
2417 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] webapp=null path=null
params={event=firstSearcherq=static+firstSearcher+warming+in+solrconfig.xmldistrib=false}
hits=0 status=0 QTime=1
2417 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener done.
2417 [searcherExecutor-16-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: default
2417 [searcherExecutor-16-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: wordbreak
2418 [searcherExecutor-16-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] Registered new searcher
Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
2420 [coreLoadExecutor-3-thread-3] INFO  org.apache.solr.core.CoreContainer
 – registering core: foo-20130912

--- NO HANG EmbeddedSolrServer (firstSearcher queries on) ---
...
1797 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – No queryConverter
defined, using default converter
1797 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.component.QueryElevationComponent  – Loading
QueryElevation from: /var/lib/myapp/conf/elevate.xml
1800 [coreLoadExecutor-3-thread-1] INFO
 org.apache.solr.handler.ReplicationHandler  – Commits will be reserved for
 1
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener sending requests to
Searcher@27b104d7main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
QuerySenderListener done.
1801 [searcherExecutor-15-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: default
1801 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.CoreContainer
 – registering core: foo-20130912
1801 [searcherExecutor-15-thread-1] INFO
 org.apache.solr.handler.component.SpellCheckComponent  – Loading spell
index for spellchecker: wordbreak
1801 [searcherExecutor-15-thread-1] INFO  org.apache.solr.core.SolrCore  –
[foo-20130912] Registered new searcher
Searcher@27b104d7main{StandardDirectoryReader(segments_3:23
_9(4.4):C57862)}


On Fri, Sep 6, 2013 at 4:29 PM, Austin Rasmussen arasmus...@directs.comwrote:

 : Do all of your cores have newSearcher event listners configured or just
 : 2 (i'm trying to figure out if it's a timing fluke that these two are
 stalled, or if it's something special about the configs)

 All of my cores have both the newSearcher and firstSearcher event
 listeners configured. (The firstSearcher actually doesn't have any queries
 configured against it, so it probably should just be removed altogether)

 : Can you try removing the newSearcher listners to confirm that that does
 in fact make the problem go away?

 Removing the newSearcher listeners does not make the problem go away;
 however, removing the firstSearcher listener (even if the newSearcher
 listener is still configured) does make the problem go away.

 : With the newSearcher listeners in place, Can you try setting
 spellcheck=false as a query param on the newSearcher listeners you have
 configured and
 : see if that works arround the problem?

 Adding the spellcheck=false param to the firstSearcher listener does
 appear to work around the problem.

 : Assuming it's just 2 cores using these listeners: can you reproduce this
 problem with a simpler seup where only one of the affected cores is in use?

 Since it's not just these two cores, I'm not sure how to produce much of a
 simpler setup.  I did attempt to limit how many cores are loaded in the
 solr.xml, and found that if I cut it down to 56, it was able to load
 successfully (without any of the above config changed).

 If I cut it down to 57 cores, it doesn't

Re: SOLRJ replace document

To replace a Solr document, simply add it again using the same technique 
used to insert the original document. The set option for atomic update is 
only used when you wish to selectively update only some of the fields for a 
document, and that does require that the update log be enabled using 
updateLog.


-- Jack Krupansky

-Original Message- 
From: Brent Ryan

Sent: Friday, October 18, 2013 4:59 PM
To: solr-user@lucene.apache.org
Subject: SOLRJ replace document

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless updateLog/ is configured

I don't want to do partial updates, I just want to replace it...


Thanks,
Brent

Re: SOLRJ replace document

I wish that was the case but calling addDoc() is what's triggering that
exception.

On Friday, October 18, 2013, Jack Krupansky wrote:

 To replace a Solr document, simply add it again using the same
 technique used to insert the original document. The set option for atomic
 update is only used when you wish to selectively update only some of the
 fields for a document, and that does require that the update log be enabled
 using updateLog.

 -- Jack Krupansky

 -Original Message- From: Brent Ryan
 Sent: Friday, October 18, 2013 4:59 PM
 To: solr-user@lucene.apache.org
 Subject: SOLRJ replace document

 How do I replace a document in solr using solrj library?  I keep getting
 this error back:

 org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException:
 Atomic document updates are not supported unless updateLog/ is configured

 I don't want to do partial updates, I just want to replace it...


 Thanks,
 Brent

Re: SOLRJ replace document

2013-10-18 Thread Shawn Heisey


On 10/18/2013 2:59 PM, Brent Ryan wrote:

How do I replace a document in solr using solrj library?  I keep getting
this error back:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Atomic document updates are not supported unless updateLog/ is configured

I don't want to do partial updates, I just want to replace it...


Replacing a document is done by simply adding the document, in the same 
way as if you were adding a new one.  If you have properly configured 
Solr, the old one will be deleted before the new one is inserted.  
Properly configuring Solr means that you have a uniqueKey field in 
yourschema, and that it is a simple type like string, int, long, etc, 
and is not multivalued. A TextField type that is tokenized cannot be 
used as the uniqueKey field.


Thanks,
Shawn

Re: loading djvu xml into solr

2013-10-18 Thread sara amato

Ah, thanks for the clarification - I was having a serious misunderstanding!  
(As you can tell I'm newly off the tutorial and blundering ahead...)

On Oct 18, 2013, at 2:22 PM, Upayavira wrote:

 
 
 On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote:
 Does anyone have a schema they'd be willing to share for loading djvu xml
 into solr?  
 
 I assume that djvu XML is a particular XML format? In which case, there
 is no schema that can do it. That's not how Solr works.
 
 You need to use the XML format expected by Solr. Or, you can add
 tr=.xsl to the URL, and use an XSL stylesheet to transform your XML
 into Solr's XML format.
 
 The schema defines the fields that are present in the index, not the
 format of the XML used.
 
 Upayavira

Re: SOLRJ replace document

My schema is pretty simple and has a string field called solr_id as my
unique key.  Once I get back to my computer I'll send some more details.

Brent

On Friday, October 18, 2013, Shawn Heisey wrote:

 On 10/18/2013 2:59 PM, Brent Ryan wrote:

 How do I replace a document in solr using solrj library?  I keep getting
 this error back:

 org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException:
 Atomic document updates are not supported unless updateLog/ is
 configured

 I don't want to do partial updates, I just want to replace it...


 Replacing a document is done by simply adding the document, in the same
 way as if you were adding a new one.  If you have properly configured Solr,
 the old one will be deleted before the new one is inserted.  Properly
 configuring Solr means that you have a uniqueKey field in yourschema, and
 that it is a simple type like string, int, long, etc, and is not
 multivalued. A TextField type that is tokenized cannot be used as the
 uniqueKey field.

 Thanks,
 Shawn

Re: Issues with Language detection in Solr


Sorry, but Latin is not on the list of supported languages:

https://code.google.com/p/language-detection/wiki/LanguageList

-- Jack Krupansky

-Original Message- 
From: vibhoreng04

Sent: Friday, October 18, 2013 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Issues with Language detection in Solr

I agree with you Jack . But I request you to see here that still this filter
works perfectly fine .Only in one case  case where even all the words are
latin , the language is getting detected as German.My question is why and
how ?
If it works perfectly for the other docs what in this case is making it to
do abnormal behaiour ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLRJ replace document

2013-10-18 Thread Shawn Heisey


On 10/18/2013 3:36 PM, Brent Ryan wrote:

My schema is pretty simple and has a string field called solr_id as my
unique key.  Once I get back to my computer I'll send some more details.


If you are trying to use a Map object as the value of a field, that is 
probably why it is interpreting your add request as an atomic update.  
If this is the case, and you're doing it because you have a multivalued 
field, you can use a List object rather than a Map.


If this doesn't sound like what's going on, can you share your code, or 
a simplification of the SolrJ parts of it?


Thanks,
Shawn

Re: Seeking New Moderators for solr-user@lucene

2013-10-18 Thread Alexandre Rafalovitch

I'll be happy to moderate. I do it for some other lists already.

Regards,
Alex

Leader election fails in some point.

2013-10-18 Thread yriveiro

Hi,

In this screenshot I have a shard with two replicas without leader,

http://picpaste.com/qf2jdkj8.png

On machine with shard green I found this exception:

INFO  - dat5 - 2013-10-18 22:48:04.775;
org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for
coreNodeName: 192.168.20.106:8983_solr_statistics-13_shard18_replica4,
state: recovering, checkLive: true, onlyIfLeader: true
ERROR - dat5 - 2013-10-18 22:48:04.775;
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
We are not the leader
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:824)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:192)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
--
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)

On the machine with the shard in recovery state I found this exception:

INFO  - dat6 - 2013-10-18 22:48:44.131;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process
for shard shard18
INFO  - dat6 - 2013-10-18 22:48:44.137;
org.apache.solr.cloud.ShardLeaderElectionContext; Checking if I should try
and be the leader.
INFO  - dat6 - 2013-10-18 22:48:44.138;
org.apache.solr.cloud.ShardLeaderElectionContext; My last published State
was recovering, I won't be the leader.
INFO  - dat6 - 2013-10-18 22:48:44.139;
org.apache.solr.cloud.ShardLeaderElectionContext; There may be a better
leader candidate than us - going back into recovery
INFO  - dat6 - 2013-10-18 22:48:44.142;
org.apache.solr.update.DefaultSolrCoreState; Running recovery - first
canceling any ongoing recovery
WARN  - dat6 - 2013-10-18 22:48:44.142;
org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
zkNodeName=192.168.20.106:8983_solr_statistics-13_shard18_replica4core=statistics-13_shard18_replica4
INFO  - dat6 - 2013-10-18 22:48:45.131;
org.apache.solr.cloud.RecoveryStrategy; Finished recovery process.
core=statistics-13_shard18_replica4
INFO  - dat6 - 2013-10-18 22:48:45.131;
org.apache.solr.cloud.RecoveryStrategy; Starting recovery process. 
core=statistics-13_shard18_replica4 recoveringAfterStartup=false
INFO  - dat6 - 2013-10-18 22:48:45.131; org.apache.solr.cloud.ZkController;
publishing core=statistics-13_shard18_replica4 state=recovering
INFO  - dat6 - 2013-10-18 22:48:45.132; org.apache.solr.cloud.ZkController;
numShards not found on descriptor - reading it from system property
INFO  - dat6 - 2013-10-18 22:48:45.141;
org.apache.solr.client.solrj.impl.HttpClientUtil; Creating new http client,
config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
ERROR - dat6 - 2013-10-18 22:48:45.143;
org.apache.solr.common.SolrException; Error while trying to recover.
core=statistics-13_shard18_replica4:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
We are not the leader
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)

No leader means we can't index data because a 503 http status code is
returned.

Is this the normal behaviour or a bug?



-
Best regards
--
View this message in context:

Re: Solr timeout after reboot

Michael,

The servlet container controls timeouts, max threads and such. That's not a
high query rate,  but yes, it could be solr or OS caches are cold. You will
ne able too see all this in SPM for Solr while you hammer your poor Solr
servers :)

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Oct 18, 2013 11:38 AM, michael.boom my_sky...@yahoo.com wrote:

 I have a SolrCloud environment with 4 shards, each having a replica and a
 leader. The index size is about 70M docs and 60Gb, running with Jetty +
 Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM.

 I'm using SolrMeter for stress testing.
 If I restart Jetty and then try to use SolrMeter to bomb an instance with
 queries, using a query per minute rate of 3000 then that solr instance
 somehow timesout and I need to restart it again.
 If instead of using 3000 qpm i startup slowly with 200 for a minute or two,
 then 1800 and then 3000 everything is good.

 I assume this happens because Solr is not warmed up.
 What settings could I tweak so that Solr doesn't time out anymore when
 getting many requests? Is there a way to limit how many req it can serve?



 -
 Thanks,
 Michael
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: XLSB files not indexed

Hi Roland,

It looks like:
Tika - yes
Solr - no?

Based on http://search-lucene.com/?q=xlsb

ODF != XLSB though, I think...

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Oct 18, 2013 at 7:36 AM, Roland Everaert reveatw...@gmail.com wrote:
 Hi,

 Can someone tells me if tika is supposed to extract data from xlsb files
 (the new MS Office format in binary form)?

 If so then it seems that solr is not able to index them like it is not able
 to index ODF files (a JIRA is already opened for ODF
 https://issues.apache.org/jira/browse/SOLR-4809)

 Can someone confirm the problem, or tell me what to do to make solr works
 with XLSB files.


 Regards,


 Roland.

Re: SolrCloud Performance Issue

Hi,

What happens if you have just 1 shard - no distributed search, like
before? SPM for Solr or any other monitoring tool that captures OS and
Solr metrics should help you find the source of the problem faster.
Is disk IO the same? utilization of caches? JVM version, heap, etc.?
CPU usage? network?  I'd look at each of these things side by side and
look for big differences.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm



On Fri, Oct 18, 2013 at 1:38 AM, shamik sham...@gmail.com wrote:
 I tried commenting out NOW in bq, but didn't make any difference in the
 performance. I do see minor entry in the queryfiltercache rate which is a
 meager 0.02.

 I'm really struggling to figure out the bottleneck, any known pain points I
 should be checking ?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Issue-tp4095971p4096277.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLRJ replace document

So I think the issue might be related to the tech stack we're using which
is SOLR within DataStax enterprise which doesn't support atomic updates.
 But I think it must have some sort of bug around this because it doesn't
appear to work correctly for this use case when using solrj ...  Anyways,
I've contacted support so lets see what they say.


On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote:

 On 10/18/2013 3:36 PM, Brent Ryan wrote:

 My schema is pretty simple and has a string field called solr_id as my
 unique key.  Once I get back to my computer I'll send some more details.


 If you are trying to use a Map object as the value of a field, that is
 probably why it is interpreting your add request as an atomic update.  If
 this is the case, and you're doing it because you have a multivalued field,
 you can use a List object rather than a Map.

 If this doesn't sound like what's going on, can you share your code, or a
 simplification of the SolrJ parts of it?

 Thanks,
 Shawn

Re: SOLRJ replace document

2013-10-18 Thread Jason Hellman

Keep in mind that DataStax has a custom update handler, and as such isn't 
exactly a vanilla Solr implementation (even though in many ways it still is).  
Since updates are co-written to Cassandra and Solr you should always tread a 
bit carefully when slightly outside what they perceive to be norms.


On Oct 18, 2013, at 7:21 PM, Brent Ryan brent.r...@gmail.com wrote:

 So I think the issue might be related to the tech stack we're using which
 is SOLR within DataStax enterprise which doesn't support atomic updates.
 But I think it must have some sort of bug around this because it doesn't
 appear to work correctly for this use case when using solrj ...  Anyways,
 I've contacted support so lets see what they say.
 
 
 On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote:
 
 On 10/18/2013 3:36 PM, Brent Ryan wrote:
 
 My schema is pretty simple and has a string field called solr_id as my
 unique key.  Once I get back to my computer I'll send some more details.
 
 
 If you are trying to use a Map object as the value of a field, that is
 probably why it is interpreting your add request as an atomic update.  If
 this is the case, and you're doing it because you have a multivalued field,
 you can use a List object rather than a Map.
 
 If this doesn't sound like what's going on, can you share your code, or a
 simplification of the SolrJ parts of it?
 
 Thanks,
 Shawn

Re: SOLRJ replace document