Re: email datasource connect timeout issue

2013-12-19 Thread xie kidd
For the ideal, never give up, fighting!


On Thu, Dec 19, 2013 at 10:30 AM, xie kidd  wrote:

> Hi all,
>
> When i try to set up a email data source as
> http://wiki.apache.org/solr/MailEntityProcessor ,  connect timeout
> Exception happened.  i am sure the user and password is correct, and the
> rss data source also work well. anyone can do me a favior?
>
> This issue base on solr4.5 with tomcat7, exception information as
> following:
> --
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Connection 
> failed Processing Document # 1
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
>   at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
>   at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
>   at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Connection 
> failed Processing Document # 1
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:410)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
>   ... 3 more
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Connection failed Processing Document # 1
>   at 
> org.apache.solr.handler.dataimport.MailEntityProcessor.connectToMailBox(MailEntityProcessor.java:271)
>   at 
> org.apache.solr.handler.dataimport.MailEntityProcessor.getNextMail(MailEntityProcessor.java:121)
>   at 
> org.apache.solr.handler.dataimport.MailEntityProcessor.nextRow(MailEntityProcessor.java:112)
>   at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:469)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
>   ... 5 more
> Caused by: javax.mail.MessagingException: Connection timed out;
>   nested exception is:
>   java.net.ConnectException: Connection timed out
>   at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:571)
>   at javax.mail.Service.connect(Service.java:288)
>   at javax.mail.Service.connect(Service.java:169)
>   at 
> org.apache.solr.handler.dataimport.MailEntityProcessor.connectToMailBox(MailEntityProcessor.java:267)
>   ... 10 more
> Caused by: java.net.ConnectException: Connection timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
>   at java.net.Socket.connect(Socket.java:542)
>   at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:570)
>   at 
> sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:160)
>   at com.sun.mail.util.SocketFetcher.createSocket(SocketFetcher.java:233)
>   at com.sun.mail.util.SocketFetcher.getSocket(SocketFetcher.java:189)
>   at com.sun.mail.iap.Protocol.(Protocol.java:107)
>   at com.sun.mail.imap.protocol.IMAPProtocol.(IMAPProtocol.java:104)
>   at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:538)
>   ... 13 more
>
> --
>
> Thanks in advanced.
>
> Thanks,
> Kidd
>
>
> For the ideal, never give up, fighting!
>


Cross referencing of solr documents

2013-12-19 Thread neerajp
Hello All,
I have a problem as described below and would like to have your opinion:

I have multiple documents with same unique id(by setting overwrite filed as
false). Let's say I have three documents (Doc1, Doc2, Doc3) and all are
having same unique id. I can search one of any of the three documents but I
want the result should return always Doc1. So if I am searching either Doc1,
Doc2 or Doc3 the search result should always return DOC1. Pls. suggest me
how can I achieve this ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Cross-referencing-of-solr-documents-tp4107539.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Single multilingual field analyzed based on other field values

2013-12-19 Thread Trey Grainger
Hi Dave,

Sorry for the delayed reply.  Did you end up trying the (scary) caching
idea?

Yeah, there's no reasonable way today to access data from other fields from
the document in the analyzers.  Creating an update request processor which
pulls the data prior to the field-by-field analysis and injects it (in some
format) into the field that needs the data pulled from other fields is how
to do this today.

In my examples, I only inserted a prefix prior to the entire field (i.e.
en,es|hables espanol is what she asks), but if you need something more
complicated to identify specific sections of the field to use different
analyzers then you could pull that off, as well.  For example:
[langs="en"]hello world
[langs="en,es"]hables espanol is what she asks.[
autodetectOtherLangs="true" fallbackLangs="en"]some unknown language text
for identification

Then, you would just have the analyzer for the field parse the content,
pass each chunk of text into the appropriate analyzer, and then modify the
term positions and offsets as necessary.  My example in chapter 14 of Solr
in Action assumed you would be using the same languages throughout the
whole field, but it would just require a little bit of pre-parsing work to
direct the use of specific analyers only for specific parts of the content.

Frankly, I'm not sure pulling the data from another field (particularly if
you want different sections processed with different languages) is going to
be much simpler than putting it all into the field to be analyzed to begin
with (or better yet having an update request processor do it for you -
including the detection of language boundaries - inside of Solr so the
customer doesn't have to worry about it).

-Trey


On Tue, Oct 29, 2013 at 12:18 PM, davetroiano wrote:

> Hi Trey,
>
> I was reading v9 of the Solr in Action MEAP but browsing your github repo,
> so I think I'm looking at the latest stuff.
>
> Agreed that the thread caching idea is dangerous.  Perhaps it would work
> now, but it could easily break in a later version of Solr.
>
> I didn't mention another reason why I'd like to analyze based on other
> field
> values, which is that I'd like the ability to run analyzers on sub-sections
> of the MultiTextField.  e.g., given a multilingual document, run my
> text_english analyzer on the first half of a document and my text_french
> analyzer on the second half.  Of course, I could extend the prepend
> approach
> to take start and end offsets (e.g.,  name="myField">[en_0_1000,fr_1001_2500|]blah, blah, ...), but if it
> were possible I'd rather grab that data from another field and simplify the
> tokenizer (in terms of the string manipulation and having to adjust
> position
> offsets to ignore the prepended data... though you've already done the
> tricky part).
>
> Based on what I'm seeing on the message boards and JIRA (e.g., SOLR-1536 /
> SOLR-1327 not being fixed), it seems like there isn't a clean way to run
> analyzers dynamically based on data in other field(s).  If I end up trying
> the caching idea, I'll report my findings here.
>
> Thanks,
> Dave
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Single-multilingual-field-analyzed-based-on-other-field-values-tp4098141p4098242.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


SOLR 4.6 and getPayloadsForQuery

2013-12-19 Thread Puneet Pawaia
Hi all
Is there any way to get the payloads for a qurey in Solr.
Lucene has a class PayloadSpanUtil that has a function called
getPayloadsForQuery that gets the payloads for terms that match. Is there
something similar in Solr ?
TIA
Puneet


Re: LFU cache and autowarming

2013-12-19 Thread Patrick O'Lone
Well, I haven't tested it - if it's not ready yet I will probably avoid
for now.

> On 12/19/2013 1:46 PM, Patrick O'Lone wrote:
>> If I was to use the LFU cache instead of FastLRU on the filter cache, if
>> I enable auto-warming on that cache type - does it warm the most
>> frequently used fq on the filter cache? Thanks for any info!
> 
> I wrote that cache.  It's a really really crappy implementation, I would
> only expect it to work well if it's the cache is very very small.
> 
> I do have a replacement implementation that's just about ready, but I've
> not been able to find 'round tuits to work on getting it polished and
> committed.
> 
> https://issues.apache.org/jira/browse/SOLR-2906
> https://issues.apache.org/jira/browse/SOLR-3393
> 
> Thanks,
> Shawn
> 
> 


-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone  309-743-0809
Fax .. 309-743-0830


Re: LFU cache and autowarming

2013-12-19 Thread Shawn Heisey
On 12/19/2013 1:46 PM, Patrick O'Lone wrote:
> If I was to use the LFU cache instead of FastLRU on the filter cache, if
> I enable auto-warming on that cache type - does it warm the most
> frequently used fq on the filter cache? Thanks for any info!

I wrote that cache.  It's a really really crappy implementation, I would
only expect it to work well if it's the cache is very very small.

I do have a replacement implementation that's just about ready, but I've
not been able to find 'round tuits to work on getting it polished and
committed.

https://issues.apache.org/jira/browse/SOLR-2906
https://issues.apache.org/jira/browse/SOLR-3393

Thanks,
Shawn



LFU cache and autowarming

2013-12-19 Thread Patrick O'Lone
If I was to use the LFU cache instead of FastLRU on the filter cache, if
I enable auto-warming on that cache type - does it warm the most
frequently used fq on the filter cache? Thanks for any info!

-- 
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@townnews.com
Phone  309-743-0809
Fax .. 309-743-0830


Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
I implemented the PostFilter approach described by Joel. Just iterating
over the OpenBitSet, even without the scaling or the HashMap lookup, added
30ms to a query time, which kinda surprised me. There were about 150K hits
out of a total of 500K. Is OpenBitSet the best way to do this?

Thanks,
Peter


On Thu, Dec 19, 2013 at 9:51 AM, Peter Keegan wrote:

> In order to size the PriorityQueue, the result window size for the query
> is needed. This has been computed in the SolrIndexSearcher and available
> in: QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for
> the PostFilter in either the SolrParms or SolrQueryRequest. Is there a way
> to get this precomputed value or do I have to duplicate the logic from
> SolrIndexSearcher?
>
> Thanks,
> Peter
>
>
> On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein wrote:
>
>> Thanks, I agree this powerful stuff. One of the reasons that I haven't
>> gotten back to pluggable collectors is that I've been using PostFilters
>> instead.
>>
>> When you start doing stuff with scores in postfilters you'll run into the
>> bug in SOLR-5416. This will effect you when you use facets in combination
>> with the QueryResultCache or tag and exclude faceting.
>>
>> The patch in SOLR-5416 resolves this issue. You'll just need your
>> PostFilter to implement ScoreFilter and the SolrIndexSearcher will know
>> how
>> to handle things.
>>
>> The DelegatingCollector.finish() method is so new, these kinds of bugs are
>> still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan > >wrote:
>>
>> > This is pretty cool, and worthy of adding to Solr in Action (v2) and the
>> > other books. With function queries, flexible filter processing and
>> caching,
>> > custom collectors, and post filters, there's a lot of flexibility here.
>> >
>> > Btw, the query times using a custom collector to scale/recompute scores
>> is
>> > excellent (will have to see how it compares to your outlined solution).
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
>> > wrote:
>> >
>> > > The sorting is going to happen in the lower level collectors. You
>> need a
>> > > value source that returns the score of the document being collected.
>> > >
>> > > Here is how you can make this happen:
>> > >
>> > > 1) Create an object in your PostFilter that simply holds the current
>> > score.
>> > > Place this object in the SearchRequest context map. Update
>> object.score
>> > as
>> > > you pass the docs and scores to the lower collectors.
>> > >
>> > > 2) Create a values source that checks the SearchRequest context for
>> the
>> > > object that's holding the current score. Use this object to return the
>> > > current score when called. For example if you give the value source a
>> > > handle called "score" a compound function call will look like this:
>> > > sum(score(), field(x))
>> > >
>> > > Joel
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan > > > >wrote:
>> > >
>> > > > Regarding my original goal, which is to perform a math function
>> using
>> > the
>> > > > scaled score and a field value, and sort on the result, how does
>> this
>> > fit
>> > > > in? Must I implement another custom PostFilter with a higher cost
>> than
>> > > the
>> > > > scale PostFilter?
>> > > >
>> > > > Thanks,
>> > > > Peter
>> > > >
>> > > >
>> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
>> peterlkee...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Thanks very much for the guidance. I'd be happy to donate a
>> working
>> > > > > solution.
>> > > > >
>> > > > > Peter
>> > > > >
>> > > > >
>> > > > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein <
>> joels...@gmail.com
>> > > > >wrote:
>> > > > >
>> > > > >> SOLR-5020 has the commit info, it's mainly changes to
>> > > SolrIndexSearcher
>> > > > I
>> > > > >> believe. They might apply to 4.3.
>> > > > >> I think as long you have the finish method that's all you'll
>> need.
>> > If
>> > > > you
>> > > > >> can get this working it would be excellent if you could donate
>> back
>> > > the
>> > > > >> Scale PostFilter.
>> > > > >>
>> > > > >>
>> > > > >> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan <
>> > peterlkee...@gmail.com
>> > > > >> >wrote:
>> > > > >>
>> > > > >> > This is what I was looking for, but the DelegatingCollector
>> > 'finish'
>> > > > >> method
>> > > > >> > doesn't exist in 4.3.0 :(   Can this be patched in and are
>> there
>> > any
>> > > > >> other
>> > > > >> > PostFilter dependencies on 4.5?
>> > > > >> >
>> > > > >> > Thanks,
>> > > > >> > Peter
>> > > > >> >
>> > > > >> >
>> > > > >> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein <
>> > joels...@gmail.com
>> > > >
>> > > > >> > wrote:
>> > > > >> >
>> > > > >> > > Here is one approach to use in a postfilter
>> > > > >> > >
>> > > > >> > > 1) In the collect() method call score for each doc. Use t

Re: Solr hanging when extracting a some broken .doc files

2013-12-19 Thread Augusto Camarotti
Hey Andrea! thanks for answering, this is the complete stack trace is following 
below. (the other is just the same):
I'm going to try that modification of the logging level but i'm really 
considering to debug tika and try to correct it myself.
 
 

03:38:23ERRORSolrCoreorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@386f9474
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@386f9474
 at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
 at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
 at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.OfficeParser@386f9474
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 ... 32 more
Caused by: java.lang.IllegalStateException: Told we're for characters 122 -> 
978, but actually covers 855 characters!
 at org.apache.poi.hwpf.model.TextPiece.(TextPiece.java:73)
 at org.apache.poi.hwpf.model.TextPieceTable.(TextPieceTable.java:111)
 at org.apache.poi.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
 at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:72)
 at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:462)
 at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 35 more


>>> Andrea Gazzarini  17/12/2013 16:43 >>>
Hi Augusto,
I don't believe the mailing list allows attachments. Could you please post
the complete stacktrace? In addition, set the logging level of tika classes
to FINEST in sol

Re: Solr cloud (4.6.0) instances going down

2013-12-19 Thread Yago Riveiro
I have a lot of problem with the stability of my cloud. 

To improve the stability:

- Move zookeeper to another disk, the I/O from solr.home can kill your ensemble.

- Raise the zkTimeoutLimit to 60s

- Don't use a very big heap if you don't need, try with values around 4g and 
increase until OOM doesn't happen.

- Use the recommendations to tune the heap from 
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning, 99% of my problems with 
zookeeper was fixed.

- Log gc times, I discover pauses of 32s on my boxes, totally killer for 
zookeeper, the result, tons of session expired. 


-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, December 19, 2013 at 5:45 PM, Shawn Heisey wrote:

> On 12/19/2013 3:44 AM, ilay raja wrote:
> > I have deployed solr cloud with external zookeeper ensemble (5
> > instances). I am running solr instances on two servers with single shard
> > index. There are 6 replicas. I often see solr going down during high search
> > load (or) whenever i run indexing documents. I tried tuning hardcommit
> > (kept as 15 mins) and softcommits(12 mins). Also, set zkClientTimeout as 30
> > secs. I observed sometimes OOM, Socket exceptions., EOF exceptions in solr
> > logs while the instance is going down. Also, zookeeper recovery for the
> > solr instance is going in loop  My use case is sort of high search (100
> > queries per sec) / heavy indexing (10 K docs per minute). What is the best
> > way to keep stable solr cloud isntances with external ensemble. Should we
> > try running zookeeper internally, because looks like zookeeper handshaking
> > might be an issue as well. Is solr cloud stable for production ? or there
> > are open issues still. Please guide me.
> > 
> 
> 
> You definitely do not want to run zookeeper embedded in Solr. The
> simple reason for this is simply because if you stop Solr, you also stop
> zookeeper. Zookeeper works best if it remains up all the time, so an
> external ensemble is highly recommended.
> 
> It's probably a good idea to set the max heap on the zookeeper startup
> ... one of my zk java instances is using 65MB resident memory, so unless
> it's a very large cloud, a low number like 128MB would probably be enough.
> 
> I've heard that heavy I/O on the disk with the zookeeper data can cause
> problems for zookeeper. This is the one danger that can come from
> putting both Solr and an external zookeeper on the same host, which is
> usually a very safe thing to do. Unless you've got very fast I/O, it's
> recommended that the zookeeper data is put on separate disk spindles
> from anything else. When Solr has performance problems, it's usually
> from heavy I/O, and if heavy I/O is causing problems with zookeeper,
> then the problem just compounds itself.
> 
> You haven't indicated how big the java heap for Solr is. Severe
> stability problems can result from GC pauses, so it's extremely
> important to tune your garbage collection unless your Solr max heap is
> very very small (less than 1GB). Here's my personal wiki page with
> settings that work for me, they seem to work for others too:
> 
> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
> 
> Severe GC pause problems can also result from the Solr java heap being
> too small. Here's a more involved wiki page on performance issues that
> I have seen:
> 
> http://wiki.apache.org/solr/SolrPerformanceProblems
> 
> Thanks,
> Shawn
> 
> 




Re: Solr could replace shards

2013-12-19 Thread Michael Della Bitta
I would make one *collection* for each date range and then make a
collection alias or aliases that span the ones that you want to query.

http://wiki.apache.org/solr/SolrCloud#Collection_Aliases

I don't have a good idea for you for how to handle indexing off-cluster,
however.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Wed, Dec 18, 2013 at 4:45 PM, Max Hansmire  wrote:

> I am considering using SolrCloud, but I have a use case that I am not sure
> if it covers.
>
> I would like to keep an index up to date in realtime, but also I would like
> to sometimes restate the past. The way that I would restate the past is to
> do batch processing over historical data.
>
> My idea is that I would have the Solr collection sharded by date range. As
> I move forward in time I would add more shards.
>
> For restating historical data I would have a separate process that actually
> indexes a shards worth of data. (This keeps the servers that are meant for
> production search from having to handle the load of indexing historically.)
> I would then move the index files to the solr servers and register the
> newly created index with the server replacing the existing shards.
>
> I used to be able to do something similar pre-SolrCloud by using the core
> admin. But this did not have the benefit of having one search for the
> entire "collection". I had to manually query each of the cores to get the
> full search index.
>
> Essentially the question is:
> 1- is it possible to shard by date range in this way?
> 2- is it possible to swap out the index used by a shard?
> 3- is there a different way I should be thinking of this?
>
> Max
>


Re: Solr cloud (4.6.0) instances going down

2013-12-19 Thread Shawn Heisey
On 12/19/2013 3:44 AM, ilay raja wrote:
>   I have deployed solr cloud with external zookeeper ensemble (5
> instances). I am running solr instances on two servers with single shard
> index. There are 6 replicas. I often see solr going down during high search
> load (or) whenever i run indexing documents. I tried tuning hardcommit
> (kept as 15 mins) and softcommits(12 mins). Also, set zkClientTimeout as 30
> secs. I observed sometimes OOM, Socket exceptions., EOF exceptions in solr
> logs while the instance is going down. Also, zookeeper recovery for the
> solr instance is going in loop  My use case is sort of high search (100
> queries per sec) / heavy indexing (10 K docs per minute). What is the best
> way to keep stable solr cloud isntances with external ensemble. Should we
> try running zookeeper internally, because looks like zookeeper handshaking
> might be an issue as well. Is solr cloud stable for production ? or there
> are open issues still. Please guide me.

You definitely do not want to run zookeeper embedded in Solr.  The
simple reason for this is simply because if you stop Solr, you also stop
zookeeper.  Zookeeper works best if it remains up all the time, so an
external ensemble is highly recommended.

It's probably a good idea to set the max heap on the zookeeper startup
... one of my zk java instances is using 65MB resident memory, so unless
it's a very large cloud, a low number like 128MB would probably be enough.

I've heard that heavy I/O on the disk with the zookeeper data can cause
problems for zookeeper.  This is the one danger that can come from
putting both Solr and an external zookeeper on the same host, which is
usually a very safe thing to do.  Unless you've got very fast I/O, it's
recommended that the zookeeper data is put on separate disk spindles
from anything else.  When Solr has performance problems, it's usually
from heavy I/O, and if heavy I/O is causing problems with zookeeper,
then the problem just compounds itself.

You haven't indicated how big the java heap for Solr is.  Severe
stability problems can result from GC pauses, so it's extremely
important to tune your garbage collection unless your Solr max heap is
very very small (less than 1GB).  Here's my personal wiki page with
settings that work for me, they seem to work for others too:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Severe GC pause problems can also result from the Solr java heap being
too small.  Here's a more involved wiki page on performance issues that
I have seen:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Daniel Collins
Are you using a NRT solution, how often do you commit?  We see similar
issues with PeerSync, but then we have a very active NRT system and we
soft-commit sub-second, so since PeerSync has a limit of 100 versions
before it decides its too much to do, if we try and PeerSync whilst
indexing is running, we inevitably have to fallback to a full-sync as this
does.

What Solr version are you using?  There were issues with early 4.X (up to
4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
it did a fullCopy when what it should do is an incremental copy (i.e. only
the segments that are missing) which again might hurt you.

But surely your real problem isn't that the recovery taking a long time,
your problem is why did the system enter recovery in the first place (which
is the bit *just before* the trace you gave us!)
The first line is "It has been requested that we recover" (aren't Solr
trace message polite), which I know from recent experience means either the
leader thinks this replica is out of date or you just had a leadership
transfer.  As mark says, the root cause of that is probably a ZK timeout
issue



On 19 December 2013 15:50, Mark Miller  wrote:

> Sounds like you need to raise your ZooKeeper connection timeout.
>
> Also, make sure you are using a concurrent garbage collector as a side
> note - stop the world pauses should be avoided. Just good advice :)
>
> - Mark
>
> On Dec 18, 2013, at 5:48 AM, Anca Kopetz  wrote:
>
> > Hi,
> >
> > In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
> time to time into recovering state, and it takes more than 10 minutes to
> finish to recover.
> >
> > In logs, we see that "PeerSync Recovery" fails with the message :
> >
> > PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
> updates received since start - startingUpdates no longer overlaps with our
> currentUpdates
> >
> > Then "Replication Recovery" starts.
> >
> > Is there something we can do to avoid the failure of "Peer Recovery" so
> that the recovery process is more rapid (less than 10 minutes) ?
> >
> > The full trace log is here :
> >
> > 2013-12-05 13:51:53,740 [http-8080-46] INFO
>  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>  - It has been requested that we recover
> > 2013-12-05 13:51:53,740 [http-8080-112] INFO
>  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>  - It has been requested that we recover
> > 2013-12-05 13:51:53,740 [http-8080-112] INFO
>  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
> [admin] webapp=null path=/admin/cores
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
> QTime=0
> > 2013-12-05 13:51:53,740 [Thread-1544] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,741 [http-8080-46] INFO
>  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
> [admin] webapp=null path=/admin/cores
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
> QTime=1
> > 2013-12-05 13:51:53,740 [Thread-1543] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,743 [Thread-1544] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,746 [Thread-1543] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,755 [Thread-1543] WARN
>  org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
> zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
> > 2013-12-05 13:51:53,756 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
> process.  core=fr_green recoveringAfterStartup=false
> > 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
> process. core=fr_green
> > 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
> process.  core=fr_green recoveringAfterStartup=false
> > 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,767 [RecoveryThread] INFO
>  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
> Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > 2013-12-05 13:51:54,777 [main-EventThread] INFO
>  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
> change: WatchedEvent s

Re: Bad fieldNorm when using morphologic synonyms

2013-12-19 Thread Isaac Hebsh
Roman, do you have any results?

created SOLR-5561

Robert, if I'm wrong, you are welcome to close that issue.


On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh  wrote:

> You can see the norm value, in the "explain" text, when setting
> debugQuery=true.
> If the same item gets different norm before/after, that's it.
>
> Note that this configuration is in schema.xml (not solrconfig.xml...)
>
> On Monday, December 9, 2013, Roman Chyla wrote:
>
>> Isaac, is there an easy way to recognize this problem? We also index
>> synonym tokens in the same position (like you do, and I'm sure that our
>> positions are set correctly). I could test whether the default similarity
>> factory in solrconfig.xml had any effect (before/after reindexing).
>>
>> --roman
>>
>>
>> On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
>> wrote:
>>
>> > Hi Robert and Manuel.
>> >
>> > The DefaultSimilarity indeed sets discountOverlap to true by default.
>> > BUT, the *factory*, aka DefaultSimilarityFactory, when called by
>> > IndexSchema (the getSimilarity method), explicitly sets this value to
>> the
>> > value of its corresponding class member.
>> > This class member is initialized to be FALSE  when the instance is
>> created
>> > (like every boolean variable in the world). It should be set when "init"
>> > method is called. If the parameter is not set in schema.xml, the
>> default is
>> > true.
>> >
>> > Everything seems to be alright, but the issue is that "init" method is
>> NOT
>> > called, if the similarity is not *explicitly* declared in schema.xml. In
>> > that case, init method is not called, the discountOverlaps member (of
>> the
>> > factory class) remains FALSE, and getSimilarity explicitly calls
>> > setDiscountOverlaps with value of FALSE.
>> >
>> > This is very easy to reproduce and debug.
>> >
>> >
>> > On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir  wrote:
>> >
>> > > no, its turned on by default in the default similarity.
>> > >
>> > > as i said, all that is necessary is to fix your analyzer to emit the
>> > > proper position increments.
>> > >
>> > > On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
>> > >  wrote:
>> > > > In order to set discountOverlaps to true you must have added the
>> > > >  to the
>> schema.xml,
>> > > which
>> > > > is commented out by default!
>> > > >
>> > > > As by default this param is false, the above situation is expected
>> with
>> > > > correct positioning, as said.
>> > > >
>> > > > In order to fix the field norms you'd have to reindex with the
>> > similarity
>> > > > class which initializes the param to true.
>> > > >
>> > > > Cheers,
>> > > > Manu
>> > >
>> >
>>
>


Re: LocalParam for nested query without escaping?

2013-12-19 Thread Isaac Hebsh
created SOLR-5560


On Tue, Dec 10, 2013 at 8:48 AM, William Bell  wrote:

> Sounds like a bug.
>
>
> On Mon, Dec 9, 2013 at 1:16 PM, Isaac Hebsh  wrote:
>
> > If so, can someone suggest how a query should be escaped (securely and
> > correctly)?
> > Should I escape the quote mark (and backslash mark itself) only?
> >
> >
> > On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh 
> wrote:
> >
> > > Obviously, there is the option of external parameter ({...
> > > v=$nestedq}&nestedq=...)
> > >
> > > This is a good solution, but it is not practical, when having a lot of
> > > such nested queries.
> > >
> > > Any ideas?
> > >
> > > On Friday, December 6, 2013, Isaac Hebsh wrote:
> > >
> > >> We want to set a LocalParam on a nested query. When quering with "v"
> > >> inline parameter, it works fine:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND{!lucenedf=text
>  v="TERM2 TERM3 \"TERM4 TERM5\""}
> > >>
> > >> the parsedquery_toString is
> > >> +id:TERM1 +(text:term2 text:term3 text:"term4 term5")
> > >>
> > >> Query using the "_query_" also works fine:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND_query_:"{!lucene
> df=text}TERM2 TERM3 \"TERM4 TERM5\""
> > >>
> > >> (parsedquery is exactly the same).
> > >>
> > >>
> > >> BUT, when trying to put the nested query in place, it yields syntax
> > error:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND{!lucenedf=text}(TERM2
>  TERM3 "TERM4 TERM5")
> > >>
> > >> org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'
> > >>
> > >> The previous options are less preferred, because the escaping that
> > should
> > >> be made on the nested query.
> > >>
> > >> Can't I set a LocalParam to a nested query without escaping the query?
> > >>
> > >
> >
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


Re: Solr cloud (4.6.0) instances going down

2013-12-19 Thread ilay raja
On Thu, Dec 19, 2013 at 4:14 PM, ilay raja  wrote:

> Hi,
>
>   I have deployed solr cloud with external zookeeper ensemble (5
> instances). I am running solr instances on two servers with single shard
> index. There are 6 replicas. I often see solr going down during high search
> load (or) whenever i run indexing documents. I tried tuning hardcommit
> (kept as 15 mins) and softcommits(12 mins). Also, set zkClientTimeout as 30
> secs. I observed sometimes OOM, Socket exceptions., EOF exceptions in solr
> logs while the instance is going down. Also, zookeeper recovery for the
> solr instance is going in loop  My use case is sort of high search (100
> queries per sec) / heavy indexing (10 K docs per minute). What is the best
> way to keep stable solr cloud isntances with external ensemble. Should we
> try running zookeeper internally, because looks like zookeeper handshaking
> might be an issue as well. Is solr cloud stable for production ? or there
> are open issues still. Please guide me.
>


Re: Not able to query strings ending with special characters.

2013-12-19 Thread Jack Krupansky
That's a feature of the standard tokenizer. You'll have to use a field type 
which uses the white space tokenizer to preserve special characters.


-- Jack Krupansky

-Original Message- 
From: suren

Sent: Thursday, December 19, 2013 10:56 AM
To: solr-user@lucene.apache.org
Subject: Not able to query strings ending with special characters.

Unable to query strings ending with special characters, it is skipping the
last special character and giving the results. I am including the string in
double quotes.

For example i am unable to query strings like "JOHNSON &", "PEOPLES'".
It queries well for "JOHNSON & SONS", "PEOPLES' SELF-HELP"

I tried giving following values in fq field in solr UI.
ORGANIZATION_NAM:"peoples'"
ORGANIZATION_NAM:"peoples\'"


I am also getting same results from solrj.

my schema




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-query-strings-ending-with-special-characters-tp4107471.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Not able to query strings ending with special characters.

2013-12-19 Thread suren
Unable to query strings ending with special characters, it is skipping the
last special character and giving the results. I am including the string in
double quotes.

For example i am unable to query strings like "JOHNSON &", "PEOPLES'". 
It queries well for "JOHNSON & SONS", "PEOPLES' SELF-HELP"

I tried giving following values in fq field in solr UI.
ORGANIZATION_NAM:"peoples'"
ORGANIZATION_NAM:"peoples\'"


I am also getting same results from solrj.

my schema




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-query-strings-ending-with-special-characters-tp4107471.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.5 - Solr Cloud is creating new cores on random nodes

2013-12-19 Thread Mark Miller
Sounds pretty weird. I would use 4.5.1. Don’t know that it will address this, 
but it’s a very good idea.

This doesn’t sound like a feature to me. I’d file a JIRA issue if it seems like 
a real problem.

Are you using the old style solr.xml with cores defined in it or the new core 
discovery mode (cores are not defined in solr.xml)?

- Mark

On Dec 18, 2013, at 6:30 PM, Ryan Wilson  wrote:

> Hello all,
> 
> I am currently in the process of building out a solr cloud with solr 4.5 on
> 4 nodes with some pretty hefty hardware. When we create the collection we
> have a replication factor of 2 and store 2 replicas per node.
> 
> While we have been experimenting, which has involved bringing nodes up and
> down as well as tanking them with OOM errors while messing with jvm
> settings, we have observed a disturbing trend where we will bring nodes
> back up and suddenly shard x has 6 replicas spread across the nodes. These
> replicas will have been created with no action on our part and we would
> much rather they not be created at all.
> 
> I have not been able to determine whether this is a bug or a feature. If
> its a bug, I will happily provide what I can to track it down. If it is a
> feature, I would very much like to turn it off.
> 
> Any Information is appreciated.
> 
> Regards,
> Ryan Wilson
> rpwils...@gmail.com



Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Mark Miller
Sounds like you need to raise your ZooKeeper connection timeout. 

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz  wrote:

> Hi,
> 
> In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
> time into recovering state, and it takes more than 10 minutes to finish to 
> recover.
> 
> In logs, we see that "PeerSync Recovery" fails with the message :
> 
> PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
> received since start - startingUpdates no longer overlaps with our 
> currentUpdates
> 
> Then "Replication Recovery" starts. 
> 
> Is there something we can do to avoid the failure of "Peer Recovery" so that 
> the recovery process is more rapid (less than 10 minutes) ?
> 
> The full trace log is here : 
> 
> 2013-12-05 13:51:53,740 [http-8080-46] INFO  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>   - It has been requested that we recover
> 2013-12-05 13:51:53,740 [http-8080-112] INFO  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>   - It has been requested that we recover
> 2013-12-05 13:51:53,740 [http-8080-112] INFO  
> org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
> webapp=null path=/admin/cores 
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
> QTime=0
> 2013-12-05 13:51:53,740 [Thread-1544] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,741 [http-8080-46] INFO  
> org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
> webapp=null path=/admin/cores 
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 
> QTime=1
> 2013-12-05 13:51:53,740 [Thread-1543] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,743 [Thread-1544] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,746 [Thread-1543] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,755 [Thread-1543] WARN  
> org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for 
> zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
> 2013-12-05 13:51:53,756 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
> core=fr_green recoveringAfterStartup=false
> 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
> process. core=fr_green
> 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
> core=fr_green recoveringAfterStartup=false
> 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
> state=recovering
> 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
> org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
> descriptor - reading it from system property
> 2013-12-05 13:51:53,767 [RecoveryThread] INFO  
> org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating 
> new http client, 
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> 2013-12-05 13:51:54,777 [main-EventThread] INFO  
> org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
> change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> 2013-12-05 13:51:56,804 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to 
> PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - 
> recoveringAfterStartup=false
> 2013-12-05 13:51:56,806 [RecoveryThread] WARN  
> org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green 
> url=http://solr-08/searchsolrnodefr too many updates received since start - 
> startingUpdates no longer overlaps with our currentUpdates
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery 
> was not successful - trying replication. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
> Recovery. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering 
> updates. core=fr_green
> 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
> org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attemptin

Re: Solr cloud Performance Problems

2013-12-19 Thread Shawn Heisey
On 12/19/2013 2:35 AM, hariprasadh89 wrote:
> We have done the solr cloud setup:
> In one machine
> 1. Centos 6.3
> 2. Apache solr 4.1
> 3. JbossasFinal 7.1.1
> 4 .ZooKeeper 
> Lets setup the zookeeper cloud on 2 machines
> 
> download and untar zookeeper in /opt/zookeeper directory on both servers
> solr1 & solr2. On both the servers do the following
> 
> root@solr1$ mkdir /opt/zookeeper/data
> root@solr1$ cp /opt/zookeeper/conf/zoo_sample.cfg
> /opt/zookeeper/conf/zoo.cfg
> root@solr1$ vim /opt/zookeeper/zoo.cfg
> Make the following changes in the zoo.cfg file
> 
> dataDir=/opt/zookeeper/data
> server.1=solr1:2888:3888
> server.2=solr2:2888:3888
> Save the zoo.cfg file.
> ssign different ids to the zookeeper servers
> 
> on solr1
> 
> root@solr1$ cat 1 > /opt/zookeeper/data/myid
> 
> on solr2
> 
> root@solr2$ cat 2 > /opt/zookeeper/data/myid
> 
> Start zookeeper on both the servers
> 
> root@solr1$ cd /opt/zookeeper
> root@solr1$ ./bin/zkServer.sh start
> Note : in future when you need to reset the cluster/shards information do
> the following
> 
> 4.RAM-2GB
> 5.set the heap size to 1GB
> Extracted the solr.war and change the solr home in web.xml of solr.
> In bin folder of jboss ,the JAVA_OPTS parameter has been set in
> standalone.conf
> java -DzkHost=solr1:2181,solr2:2181 -Dbootstrap_confdir=solr/corename/conf/
> -DnumShards=2
> 
> Restart the jboss
> 
> In another machine
> 
> 1. Centos 6.3
> 2. Apache solr 4.1
> 3. JbossasFinal 7.1.1
> 4.RAM-2GB
> 5.set the heap size to 1GB
> Extracted the solr.war and change the solr home in web.xml of solr.
> In bin folder of jboss ,the JAVA_OPTS parameter has been set in
> standalone.conf
> 
> Restart the jboss
> 
> Everything has been  done properly.
> But it is taking too much time to upload data into solr.
> It is taking more time than uploading data with one solr without shard
> concept
> Able to view two shards in solr cloud option present in ui.
> Please explain how the larger index splits and allocated in two shards.
> 
> Please suggest some optimization techniques.

First problem, but not likely the cause of your complaint -- replicated
zookeeper requires three hosts minimum.  If you only have two, they both
have to be up, or quorum is lost.  Zookeeper requires a majority
[(n/2)+1] of hosts to be active to maintain quorum.  With three or four
hosts, one may be down.  With five or six hosts, two may be down.  You
need to add another host for zookeeper.  It does not need to be a
powerful host, because SolrCloud typically will NOT put much load on
zookeeper.

A 1GB heap is pretty small in the Solr world.  With 2GB total RAM and
1GB heap, if the index size on each server is bigger than about 1.5-2GB,
performance will be terrible.  My production Solr server has over 45GB
of index per server, so I have 64GB of RAM on each server.  6GB of that
goes to Solr's java heap.

http://wiki.apache.org/solr/SolrPerformanceProblems

SolrCloud amplifies existing performance problems because indexing
requests must be forwarded to the leader of a shard, which will forward
it to all replicas of that shard.  If there are no underlying
performance issues, SolrCloud will usually index almost as fast as
standard unsharded Solr.

Also, Solr 4.1.0 is very old at this point, released back in January of
this year.  SolrCloud was new in 4.0, and has been evolving very quickly
through new releases.  Version 4.6.0 is the latest, released on December
2nd.  You can see the release history back to 4.0 here:

http://lucene.apache.org/solr/solrnews.html

One more final thing -- Solr works best in the jetty that's included
with the download in the example directory.  With a more complex
container like Tomcat or JBoss, memory requirements will be elevated a
little bit, which when working in a tight 2GB memory space, can make a
significant difference.

Thanks,
Shawn



Re: [Announce] Apache Solr 4.6 with RankingAlgorithm 1.5.2 available now with complex-lsa algorithm (simulates human language acquisition and recognition)

2013-12-19 Thread Alessandro Benedetti
Hi Nagendra,
really cool topic.
I'm really interested in discover more information about the three
similraties algorithm you offer ( Term Similarity, Document Similiraty and
Term In Document Similarity).
I was looking for more details and explanations behind your Ranking
Algorithm.
Where could i start ? I'm really curious of the LSA Model used and also low
level algorithm and science applied.
Good work !

Cheers


2013/12/15 Nagendra Nagarajayya 

> Hi!
>
> I am very excited to announce the availability of Solr 4.6 with
> RankingAlgorithm 1.5.2.
>
> Solr 4.6 with RankingAlgorithm 1.5.2 includes the new algorithm
> complex-lsa. complex-lsa simulates human language acquisition and
> recognition (see demo )
> and can retrieve semantically related/hidden relationships between terms,
> sentences, paragraphs, chapters, books, images, etc. Three new
> similarities, TERM_SIMILARITY, DOCUMENT_SIMILARITY,
> TERM_DOCUMENT_SIMILARITY enable these with improved precision.
>
> Solr 4.6 with RankingAlgorithm 1.5.2 also includes realtime-search with
> multiple granularities. realtime-search is very fast NRT and allows you to
> not only lookup a document by id but also allows you to search in realtime,
> see http://tgels.org/realtime-nrt.jsp. The update performance is about
> 70,000 docs / sec. The query performance is in ms, allows you to  query a
> 10m wikipedia index (complete index) in <50 ms.
>
> RankingAlgorithm 1.5.2 with complex-lsa supports the entire Lucene Query
> Syntax, ą and/or boolean/dismax/glob/regular 
> expression/wildcard/fuzzy/prefix/suffix
> queries with boosting, etc. and is compatible with the new Lucene 4.6 api.
>
> You can get more information about complex-lsa and realtime-search
> performance from here:
> http://solr-ra.tgels.org/wiki/en/Complex-lsa-demo
> http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
>
> You can download Solr 4.6 with RankingAlgorithm 1.5.2 from here:
> http://solr-ra.tgels.org
>
> Please download and give the new version a try.
>
> Regards,
>
> Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://elasticsearch-ra.tgels.org
> http://rankingalgorithm.tgels.org
>
> Note:
> 1. Apache Solr 4.6 with RankingAlgorithm 1.5.2 is an external project.
> 2. realtime-search has been contributed back to Apache Solr, see
> https://issues.apache.org/jira/browse/SOLR-3816
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
In order to size the PriorityQueue, the result window size for the query is
needed. This has been computed in the SolrIndexSearcher and available in:
QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for the
PostFilter in either the SolrParms or SolrQueryRequest. Is there a way to
get this precomputed value or do I have to duplicate the logic from
SolrIndexSearcher?

Thanks,
Peter


On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein  wrote:

> Thanks, I agree this powerful stuff. One of the reasons that I haven't
> gotten back to pluggable collectors is that I've been using PostFilters
> instead.
>
> When you start doing stuff with scores in postfilters you'll run into the
> bug in SOLR-5416. This will effect you when you use facets in combination
> with the QueryResultCache or tag and exclude faceting.
>
> The patch in SOLR-5416 resolves this issue. You'll just need your
> PostFilter to implement ScoreFilter and the SolrIndexSearcher will know how
> to handle things.
>
> The DelegatingCollector.finish() method is so new, these kinds of bugs are
> still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.
>
>
>
>
>
>
>
>
>
> On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan  >wrote:
>
> > This is pretty cool, and worthy of adding to Solr in Action (v2) and the
> > other books. With function queries, flexible filter processing and
> caching,
> > custom collectors, and post filters, there's a lot of flexibility here.
> >
> > Btw, the query times using a custom collector to scale/recompute scores
> is
> > excellent (will have to see how it compares to your outlined solution).
> >
> > Thanks,
> > Peter
> >
> >
> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
> > wrote:
> >
> > > The sorting is going to happen in the lower level collectors. You need
> a
> > > value source that returns the score of the document being collected.
> > >
> > > Here is how you can make this happen:
> > >
> > > 1) Create an object in your PostFilter that simply holds the current
> > score.
> > > Place this object in the SearchRequest context map. Update object.score
> > as
> > > you pass the docs and scores to the lower collectors.
> > >
> > > 2) Create a values source that checks the SearchRequest context for the
> > > object that's holding the current score. Use this object to return the
> > > current score when called. For example if you give the value source a
> > > handle called "score" a compound function call will look like this:
> > > sum(score(), field(x))
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  > > >wrote:
> > >
> > > > Regarding my original goal, which is to perform a math function using
> > the
> > > > scaled score and a field value, and sort on the result, how does this
> > fit
> > > > in? Must I implement another custom PostFilter with a higher cost
> than
> > > the
> > > > scale PostFilter?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> > > >
> > > > > Thanks very much for the guidance. I'd be happy to donate a working
> > > > > solution.
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein <
> joels...@gmail.com
> > > > >wrote:
> > > > >
> > > > >> SOLR-5020 has the commit info, it's mainly changes to
> > > SolrIndexSearcher
> > > > I
> > > > >> believe. They might apply to 4.3.
> > > > >> I think as long you have the finish method that's all you'll need.
> > If
> > > > you
> > > > >> can get this working it would be excellent if you could donate
> back
> > > the
> > > > >> Scale PostFilter.
> > > > >>
> > > > >>
> > > > >> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan <
> > peterlkee...@gmail.com
> > > > >> >wrote:
> > > > >>
> > > > >> > This is what I was looking for, but the DelegatingCollector
> > 'finish'
> > > > >> method
> > > > >> > doesn't exist in 4.3.0 :(   Can this be patched in and are there
> > any
> > > > >> other
> > > > >> > PostFilter dependencies on 4.5?
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Peter
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein <
> > joels...@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Here is one approach to use in a postfilter
> > > > >> > >
> > > > >> > > 1) In the collect() method call score for each doc. Use the
> > scores
> > > > to
> > > > >> > > create your scaleInfo.
> > > > >> > > 2) Keep a bitset of the hits and a priorityQueue of your top X
> > > > >> ScoreDocs.
> > > > >> > > 3) Don't delegate any documents to lower collectors in the
> > > collect()
> > > > >> > > method.
> > > > >> > > 4) In the finish method create a score mapping (use the hppc
> > > > >> > > IntFloatOpenHashMap) with your top X docIds pointing to their
> > > score,
> > > > >> > using
> > > > >> > > the priorityQueue created in step 2. Then iterate the bitset
> 

Re: PostingsSolrHighlighter

2013-12-19 Thread jd
> Hi Josip

Hi Liu,

> that's quite weird, to my experience highlight is strict on string field
> which needs a exact match, text fields should be fine.
>
> I copy your schema definition and do a quick test in a new core,
> everything
> is default from the tutorial, and the search component is
> using solr.HighlightComponent .

I think there is a misunderstanding the "normal" solr.HighlightComponent
is working just fine, but with the PostingsSolrHighlighter i get no
results.

> search on searchable_text can highlight text, I copied your search url and
> just change the host part, the input parameters are exactly the same,
>
> result is attached.
>
> Can you upload your complete solrconfig.xml and schema.xml?

Cya Josip



Re: problem with facets - out of memory exception

2013-12-19 Thread Marc Sturlese
Have you tried to reindex using DocValues? Fields used for faceting are
stored on disk and not on ram using the FieldCache. If you have enough
memory they will be loaded on the system cache but not on the java heap.
This is good for GC too when committing.
http://wiki.apache.org/solr/DocValues



--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-facets-out-of-memory-exception-tp4107390p4107407.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DataImport Handler, writing a new EntityProcessor

2013-12-19 Thread Mathias Lux
Hi!

Thanks for all the advice! I finally did it, the most annoying error
that took me the best of a day to figure out was that the state
variable here had to be reset:
https://bitbucket.org/dermotte/liresolr/src/d27878a71c63842cb72b84162b599d99c4408965/src/main/java/net/semanticmetadata/lire/solr/LireEntityProcessor.java?at=master#cl-56

The EntityProcessor is part of this image search plugin if anyone is
interested: https://bitbucket.org/dermotte/liresolr/

:) It's always the small things that are hard to find

cheers and thanks, Mathias

On Wed, Dec 18, 2013 at 7:26 PM, P Williams
 wrote:
> Hi Mathias,
>
> I'd recommend testing one thing at a time.  See if you can get it to work
> for one image before you try a directory of images.  Also try testing using
> the solr-testframework using your ide (I use Eclipse) to debug rather than
> your browser/print statements.  Hopefully that will give you some more
> specific knowledge of what's happening around your plugin.
>
> I also wrote an EntityProcessor plugin to read from a properties
> file.
>  Hopefully that'll give you some insight about this kind of Solr plugin and
> testing them.
>
> Cheers,
> Tricia
>
>
>
>
> On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux wrote:
>
>> Hi all!
>>
>> I've got a question regarding writing a new EntityProcessor, in the
>> same sense as the Tika one. My EntityProcessor should analyze jpg
>> images and create document fields to be used with the LIRE Solr plugin
>> (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
>> same approach as the TikaEntityProcessor, but my setup just indexes
>> the first of 1000 images. I'm using a FileListEntityProcessor to get
>> all JPEGs from a directory and then I'm handing them over (see [2]).
>> My code for the EntityProcessor is at [1]. I've tried to use the
>> DataSource as well as the filePath attribute, but it ends up all the
>> same. However, the FileListEntityProcessor is able to read all the
>> files according to the debug output, but I'm missing the link from the
>> FileListEntityProcessor to the LireEntityProcessor.
>>
>> I'd appreciate any pointer or help :)
>>
>> cheers,
>>   Mathias
>>
>> [1] LireEntityProcessor http://pastebin.com/JFajkNtf
>> [2] dataConfig http://pastebin.com/vSHucatJ
>>
>> --
>> Dr. Mathias Lux
>> Klagenfurt University, Austria
>> http://tinyurl.com/mlux-itec
>>



-- 
PD Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Reading Config files Solr

2013-12-19 Thread Mukundaraman valakumaresan
Hi

Is it possible to read configuration properties inside Solr

For eg.i have a property file
F:\solr\example\solr\collection1\conf\test.properties within which I have
lots of key,
value entries.

Is there a way to read this file using a relative path and use it inside a
custom function

Thanks & Regards
Mukund


Solr cloud (4.6.0) instances going down

2013-12-19 Thread ilay raja
Hi,

  I have deployed solr cloud with external zookeeper ensemble (5
instances). I am running solr instances on two servers with single shard
index. There are 6 replicas. I often see solr going down during high search
load (or) whenever i run indexing documents. I tried tuning hardcommit
(kept as 15 mins) and softcommits(12 mins). Also, set zkClientTimeout as 30
secs. I observed sometimes OOM, Socket exceptions., EOF exceptions in solr
logs while the instance is going down. Also, zookeeper recovery for the
solr instance is going in loop  My use case is sort of high search (100
queries per sec) / heavy indexing (10 K docs per minute). What is the best
way to keep stable solr cloud isntances with external ensemble. Should we
try running zookeeper internally, because looks like zookeeper handshaking
might be an issue as well. Is solr cloud stable for production ? or there
are open issues still. Please guide me.


Re: solr OOM Crash

2013-12-19 Thread Sébastien Michel
Hi Sandra,

I'm not sure if your problem is same as ours, but we encountered the same
issue on our Solr 4.2, the major memory usage was due to
CompressingStoredFieldsReader and GC became crazy.
In our context, we have some stored fields and for some documents the
content of the text field could be huge.

We resolved our issue with the backport of this fix :
https://issues.apache.org/jira/browse/LUCENE-4995

You should also upgrade to Solr 4.4 or more

Regards,
Sébastien


2013/12/12 Sandra Scott 

> Helllo,
>
> We are experiencing unexplained OOM crashes. We have already seen it a few
> times, over our different solr instances. The crash happens only at a
> single shard of the collection.
>
> Environment details:
> 1. Solr 4.3, running on tomcat.
> 2. 24 Shards.
> 3. Indexing rate of ~800 docs per minute.
>
> Solrconfig.xml:
> 1. Merge factor 4
> 2. Sofrcommit every 10 min
> 3. Hardcommit every 30 min
>
> Main findings:
> 1. Solr logs: No query failures prior to the OOM, but DOUBLE the amount of
> soft and hard commits in comparison to other shards.
> 2. Analyzing the dump (VisualVM): Class byte[] takes 4gb out of 5gb
> resourced to the JVM, mainly referenced by CompressingStoredFieldsReader GC
> root (which by looking at the code, we suspect they were created due to
> CompressingSortedFieldsWriter.merge).
>
> Sub findings:
> 1. GC logs: Showed 108 GC fails prior to the crash.
> 2. CPI: Overall usage seems fine, but the % of CPU time for the GC stays
> high 6 min before the OOM.
> 3. Memory: Half an hour before OOM the usage slowly rises, until it gets to
> 5.4gb.
>
> Has anyone encountered higher than normal commit rate that seem to increase
> merge rate and cause what I described?
>


Solr cloud Performance Problems

2013-12-19 Thread hariprasadh89
We have done the solr cloud setup:
In one machine
1. Centos 6.3
2. Apache solr 4.1
3. JbossasFinal 7.1.1
4 .ZooKeeper 
Lets setup the zookeeper cloud on 2 machines

download and untar zookeeper in /opt/zookeeper directory on both servers
solr1 & solr2. On both the servers do the following

root@solr1$ mkdir /opt/zookeeper/data
root@solr1$ cp /opt/zookeeper/conf/zoo_sample.cfg
/opt/zookeeper/conf/zoo.cfg
root@solr1$ vim /opt/zookeeper/zoo.cfg
Make the following changes in the zoo.cfg file

dataDir=/opt/zookeeper/data
server.1=solr1:2888:3888
server.2=solr2:2888:3888
Save the zoo.cfg file.
ssign different ids to the zookeeper servers

on solr1

root@solr1$ cat 1 > /opt/zookeeper/data/myid

on solr2

root@solr2$ cat 2 > /opt/zookeeper/data/myid

Start zookeeper on both the servers

root@solr1$ cd /opt/zookeeper
root@solr1$ ./bin/zkServer.sh start
Note : in future when you need to reset the cluster/shards information do
the following

4.RAM-2GB
5.set the heap size to 1GB
Extracted the solr.war and change the solr home in web.xml of solr.
In bin folder of jboss ,the JAVA_OPTS parameter has been set in
standalone.conf
java -DzkHost=solr1:2181,solr2:2181 -Dbootstrap_confdir=solr/corename/conf/
-DnumShards=2

Restart the jboss

In another machine

1. Centos 6.3
2. Apache solr 4.1
3. JbossasFinal 7.1.1
4.RAM-2GB
5.set the heap size to 1GB
Extracted the solr.war and change the solr home in web.xml of solr.
In bin folder of jboss ,the JAVA_OPTS parameter has been set in
standalone.conf

Restart the jboss

Everything has been  done properly.
But it is taking too much time to upload data into solr.
It is taking more time than uploading data with one solr without shard
concept
Able to view two shards in solr cloud option present in ui.
Please explain how the larger index splits and allocated in two shards.

Please suggest some optimization techniques.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-Performance-Problems-tp4107392.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr hanging when extracting a some broken .doc files

2013-12-19 Thread Raymond Wiker
On Thu, Dec 19, 2013 at 10:01 AM, Charlie Hull  wrote:

> On 18/12/2013 09:03, Alexandre Rafalovitch wrote:
>
>> Charlie,
>>
>> Does it mean you are talking to it from a client program? Or are you
>> running Tika in a listen/server mode and build some adapters for standard
>> Solr processes?
>>
>
> If we're writing indexers in Python we usually run Tika as a server -
> which means we can try to restart it if it fails to respond, usually
> because it's eaten something that disagreed with it! We'd then submit the
> extracted text to Solr.
>
>
>
We're also running Tika as a server, using tika-app.*.jar. There is also a
tika-server.*.jar, which gives an HTTP interface (instead of the raw TCP
interface offered by tika-app), but we opted to use tika-app.

We have not seen any need to restart the tika server process, although
there are cases where it takes so long to provide a reply that we abandon
the request - tika-app seems to handle that well (i.e, it does not seem to
get stuck afterwards).

There are some semi-tricky details to using tika in server mode (involving
blocking & deadlocks, and the possibility that tika loops on certain
documents), but we have been able to feed ~1M documents through a single
tika server process without restarting it.

Note that, in some cases, the xhtml output from tika is incorrect, so we've
had to switch to html output and a more forgiving parser.


problem with facets - out of memory exception

2013-12-19 Thread adfel70
Hi
I have a cluster of 14 nodes (7 shards, 2 replicas). each node with 6gb jvm.
solr 4.3.0
i have 400 million docs in the cluster, each node around 60gb of index.
I  index new docs each night, around a million a night.

As the index started to grow, i  started having problems of OutOfMmemory
when querying with facets.
the exception occurs in one of the nodes when querying a specific facet
field, when I  restart this node, and query again it doesn't happen, until I
perform some more indexing and then it might happen again with another facet
field.

the fields that cause the failure have less than 20 unique values.

any idea why this happens?
why restarting the node (without adding more memory) solves the problem
temporarily?
what solr does behind the scenes when asking for facets?

thnks.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-facets-out-of-memory-exception-tp4107390.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr hanging when extracting a some broken .doc files

2013-12-19 Thread Charlie Hull

On 18/12/2013 09:03, Alexandre Rafalovitch wrote:

Charlie,

Does it mean you are talking to it from a client program? Or are you
running Tika in a listen/server mode and build some adapters for standard
Solr processes?


If we're writing indexers in Python we usually run Tika as a server - 
which means we can try to restart it if it fails to respond, usually 
because it's eaten something that disagreed with it! We'd then submit 
the extracted text to Solr.


Regards

Charlie


Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Dec 18, 2013 at 3:47 PM, Charlie Hull  wrote:


On 17/12/2013 15:29, Augusto Camarotti wrote:


Hi guys,
 I'm having a problem with solr when trying to index some broken .doc
files.
 I have set up a test case using Solr to index all the files the
users save on the shared directorys of the company that i work for and
Solr is hanging when trying to index this file in particular(the one i'm
attaching on this e-mail). There are some others broken .doc files that
Solr index by the name without a problem, even logging some Tika erros
during the process, but when it reaches this file in particular, it
hangs and i have to cancel the upload.
 I cannot guarantee the directorys will never hold a broken .doc
file, or a broken file with some other extension, so i guess solr could
just return a failing message, or something like that.
 These are the logging messages solr is recording:
03:38:23ERROR   SolrCoreorg.apache.solr.common.
SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474
03:38:25ERROR   SolrDispatchFilter
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474

So, how do I prevent solr from hanging when trying to index broken files?
Regards,
Augusto Camarotti



We don't like to run Tika from within Solr ourselves, as it has been known
to barf (especially on large PDF files, yes there are such horrors as 3000
page PDFs!). We usually run it in an external process so it can be watched
and killed if necessary.

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Anca Kopetz

Hi,

Thank you for the link, it does not seem to be the same problem ...

Best regards,
Anca

On 12/18/2013 11:41 PM, Furkan KAMACI wrote:

Hi Anca;

Could you check the conversation at here:
http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html

Thanks;
Furkan KAMACI


18 Aralık 2013 Çarşamba tarihinde Anca Kopetz  adlı
kullanıcı şöyle yazdı:

Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from

time to time into recovering state, and it takes more than 10 minutes to
finish to recover.

In logs, we see that "PeerSync Recovery" fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many

updates received since start - startingUpdates no longer overlaps with our
currentUpdates

Then "Replication Recovery" starts.

Is there something we can do to avoid the failure of "Peer Recovery" so

that the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO

org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=0

2013-12-05 13:51:53,740 [Thread-1544] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,741 [http-8080-46] INFO

org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=1

2013-12-05 13:51:53,740 [Thread-1543] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,743 [Thread-1544] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,746 [Thread-1543] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,755 [Thread-1543] WARN

org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green

2013-12-05 13:51:53,756 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,762 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green

2013-12-05 13:51:53,762 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,765 [RecoveryThread] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,765 [RecoveryThread] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,767 [RecoveryThread] INFO

org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false

2013-12-05 13:51:54,777 [main-EventThread] INFO

org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 13:51:56,804 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
recoveringAfterStartup=false

2013-12-05 13:51:56,806 [RecoveryThread] WARN

org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
http://solr-08/searchsolrnodefr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
was not successful - trying replication. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
Replication Recovery. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
updates. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green

2013-12