Re: error

2016-02-11 Thread Midas A
solr 5.2.1

On Fri, Feb 12, 2016 at 12:59 PM, Shawn Heisey  wrote:

> On 2/11/2016 10:13 PM, Midas A wrote:
> > we have upgraded solr version last night getting following error
> >
> > org.apache.solr.common.SolrException: Bad content Type for search handler
> > :application/octet-stream
> >
> > what i should do ? to remove this .
>
> What version did you upgrade from and what version did you upgrade to?
> How was the new version installed, and how are you starting it?  What
> kind of software are you using for your clients?
>
> We also need to see all error messages in the solr logfile, including
> stacktraces.  Having access to the entire logfile would be very helpful,
> but before sharing that, you might want to check it for sensitive
> information and redact it.
>
> Thanks,
> Shawn
>
>


Re: error

2016-02-11 Thread Shawn Heisey
On 2/11/2016 10:13 PM, Midas A wrote:
> we have upgraded solr version last night getting following error
>
> org.apache.solr.common.SolrException: Bad content Type for search handler
> :application/octet-stream
>
> what i should do ? to remove this .

What version did you upgrade from and what version did you upgrade to? 
How was the new version installed, and how are you starting it?  What
kind of software are you using for your clients?

We also need to see all error messages in the solr logfile, including
stacktraces.  Having access to the entire logfile would be very helpful,
but before sharing that, you might want to check it for sensitive
information and redact it.

Thanks,
Shawn



Re: error

2016-02-11 Thread Midas A
my log is increasing . it is urgent ..

On Fri, Feb 12, 2016 at 10:43 AM, Midas A  wrote:

> we have upgraded solr version last night getting following error
>
> org.apache.solr.common.SolrException: Bad content Type for search handler
> :application/octet-stream
>
> what i should do ? to remove this .
>


Re: Need to move on SOlr cloud (help required)

2016-02-11 Thread Midas A
Erick ,

bq: We want the hits on solr servers to be distributed

True, this happens automatically in SolrCloud, but a simple load
balancer in front of master/slave does the same thing.

Midas : in case of solrcloud architecture we need not to have load balancer
? .

On Thu, Feb 11, 2016 at 11:42 PM, Erick Erickson 
wrote:

> bq: We want the hits on solr servers to be distributed
>
> True, this happens automatically in SolrCloud, but a simple load
> balancer in front of master/slave does the same thing.
>
> bq: what if master node fail what should be our fail over strategy  ?
>
> This is, indeed one of the advantages for SolrCloud, you don't have
> to worry about this any more.
>
> Another benefit (and you haven't touched on whether this matters)
> is that in SolrCloud you do not have the latency of polling and
> replicating from master to slave, in other words it supports Near Real
> Time.
>
> This comes at some additional complexity however. If you have
> your master node failing often enough to be a problem, you have
> other issues ;)...
>
> And the recovery strategy if the master fails is straightforward:
> 1> pick one of the slaves to be the master.
> 2> update the other nodes to point to the new master
> 3> re-index the docs from before the old master failed to the new master.
>
> You can use system variables to not even have to manually edit all of the
> solrconfig files, just supply different -D parameters on startup.
>
> Best,
> Erick
>
> On Wed, Feb 10, 2016 at 10:39 PM, kshitij tyagi
>  wrote:
> > @Jack
> >
> > Currently we have around 55,00,000 docs
> >
> > Its not about load on one node we have load on different nodes at
> different
> > times as our traffic is huge around 60k users at a given point of time
> >
> > We want the hits on solr servers to be distributed so we are planning to
> > move on solr cloud as it would be fault tolerant.
> >
> >
> >
> > On Thu, Feb 11, 2016 at 11:10 AM, Midas A  wrote:
> >
> >> hi,
> >> what if master node fail what should be our fail over strategy  ?
> >>
> >> On Wed, Feb 10, 2016 at 9:12 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> >> wrote:
> >>
> >> > What exactly is your motivation? I mean, the primary benefit of
> SolrCloud
> >> > is better support for sharding, and you have only a single shard. If
> you
> >> > have no need for sharding and your master-slave replicated Solr has
> been
> >> > working fine, then stick with it. If only one machine is having a load
> >> > problem, then that one node should be replaced. There are indeed
> plenty
> >> of
> >> > good reasons to prefer SolrCloud over traditional master-slave
> >> replication,
> >> > but so far you haven't touched on any of them.
> >> >
> >> > How much data (number of documents) do you have?
> >> >
> >> > What is your typical query latency?
> >> >
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi <
> >> > kshitij.shopcl...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > We are currently using solr 5.2 and I need to move on solr cloud
> >> > > architecture.
> >> > >
> >> > > As of now we are using 5 machines :
> >> > >
> >> > > 1. I am using 1 master where we are indexing ourdata.
> >> > > 2. I replicate my data on other machines
> >> > >
> >> > > One or the other machine keeps on showing high load so I am
> planning to
> >> > > move on solr cloud.
> >> > >
> >> > > Need help on following :
> >> > >
> >> > > 1. What should be my architecture in case of 5 machines to keep
> >> > (zookeeper,
> >> > > shards, core).
> >> > >
> >> > > 2. How to add a node.
> >> > >
> >> > > 3. what are the exact steps/process I need to follow in order to
> change
> >> > to
> >> > > solr cloud.
> >> > >
> >> > > 4. How indexing will work in solr cloud as of now I am using mysql
> >> query
> >> > to
> >> > > get the data on master and then index the same (how I need to change
> >> this
> >> > > in case of solr cloud).
> >> > >
> >> > > Regards,
> >> > > Kshitij
> >> > >
> >> >
> >>
>


error

2016-02-11 Thread Midas A
we have upgraded solr version last night getting following error

org.apache.solr.common.SolrException: Bad content Type for search handler
:application/octet-stream

what i should do ? to remove this .


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Again, first things first... debugQuery=true and see which Solr search
components are consuming the bulk of qtime.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla 
wrote:

> virtual hardware, 200ms is taken on the client until response is written to
> disk
> qtime on solr is ~90ms
> not great but acceptable
>
> Is it possible that the method FilenameUtils.splitOnTokens is really so
> heavy when requesting a lot of rows on slow hardware?
>
> 2016-02-11 17:17 GMT+01:00 Jack Krupansky :
>
> > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> still
> > relatively bad. Even 50ms for 10 rows would be considered barely okay.
> > But... again it depends on query complexity - simple queries should be
> well
> > under 50 ms for decent modern hardware.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla  >
> > wrote:
> >
> > > Hi Jack,
> > >   response time scale with rows. Relationship doens't seem linear
> but
> > > Below 400 rows times are much faster,
> > > I view query times from solr logs and they are fast
> > > the same query with rows = 1000 takes 8s
> > > with rows = 10 takes 0.2s
> > >
> > >
> > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
> > >
> > > > Are queries scaling linearly - does a query for 100 rows take 1/10th
> > the
> > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > >
> > > > Does the app need/expect exactly 1,000 documents for the query or is
> > that
> > > > just what this particular query happened to return?
> > > >
> > > > What does they query look like? Is it complex or use wildcards or
> > > function
> > > > queries, or is it very simple keywords? How many operators?
> > > >
> > > > Have you used the debugQuery=true parameter to see which search
> > > components
> > > > are taking the time?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Yonic,
> > > > >  after the first query I find 1000 docs in the document cache.
> > > > > I'm using curl to send the request and requesting javabin format to
> > > mimic
> > > > > the application.
> > > > > gc activity is low
> > > > > I managed to load the entire 50GB index in the filesystem cache,
> > after
> > > > that
> > > > > queries don't cause disk activity anymore.
> > > > > Time improves now queries that took ~30s take <10s. But I hoped
> > better
> > > > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > > > >
> > > > >
> > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > > > >
> > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to
> execute
> > > the
> > > > > > > query) is a fraction of a second.
> > > > > > > The response in javabin format is around 300k.
> > > > > >
> > > > > > OK, That tells us a lot.
> > > > > > And if you actually tested so that all the docs would be in the
> > cache
> > > > > > (can you verify this by looking at the cache stats after you
> > > > > > re-execute?) then it seems like the slowness is down to any of:
> > > > > > a) serializing the response (it doesn't seem like a 300K response
> > > > > > should take *that* long to serialize)
> > > > > > b) reading/processing the response (how fast the client can do
> > > > > > something with each doc is also a factor...)
> > > > > > c) other (GC, network, etc)
> > > > > >
> > > > > > You can try taking client processing out of the equation by
> trying
> > a
> > > > > > curl request.
> > > > > >
> > > > > > -Yonik
> > > > > >
> > > > >
> > > >
> > >
> >
>


RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
Y, and you can't actually kill a thread.  You can ask nicely via 
Thread.interrupt(), but some of our dependencies don't bother to listen  for 
that.  So, you're pretty much left with a separate process as the only robust 
solution.

So, we did the parent-child process thing for directory-> directory processing 
in tika-app via tika-batch.

The next step is to harden tika-server and to kick that off in a child process 
in a similar way.

For those who want to test their Tika harnesses (whether on single box, 
Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when 
it hits an "application/xml+mock" file...full set of options:




Nikolai Lobachevsky



some content



writing to System.out


writing to System.err







not another IOException





  

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, February 11, 2016 7:46 PM
To: solr-user 
Subject: Re: How is Tika used with Solr

Well, I'd imagine you could spawn threads and monitor/kill them as necessary, 
although that doesn't deal with OOM errors

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav  wrote:
> For sure, if I need heavy duty text extraction again, Tika would be 
> the obvious choice if it covers dealing with hangs. I never used 
> tika-server myself (not sure if it existed at the time) just used tika from 
> my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
> 
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar  
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
>> This creates a parent and child process, if the child process notices 
>> a hung thread, it dies, and the parent restarts it.  Or if your OS 
>> gets upset with the child process and kills it out of self 
>> preservation, the parent restarts the child, or if there's an 
>> OOM...and you can configure how often the child shuts itself down 
>> (with parental restarting) to mitigate memory leaks.
>>
>> So, y, if your use case allows  , then we now 
>> have that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any 
>> interest in that?
>>
>>
>> -Original Message-
>> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user 
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of 
>> files, in the end you find stuff (pdfs are typically nasty) that will hang 
>> tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a 
>> watchdog feature to kill what seemed like a hanged extracting thread. 
>> That feature is super important for a robust text extracting 
>> pipeline. Has Tika gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
>> 
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, 
>> > if you use the simple "run Tika in a SolrJ program" approach you 
>> > _must_ abort the program on OOM errors and the like and  figure out 
>> > what's going on with the offending document(s). Or record the name 
>> > somewhere and skip it next time 'round. Or
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing 
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all 
>> > times and _especially_ where you don't control the document corpus, 
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > 
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of 
>> > > the time
>> > (regular catchable exceptions), that'll work.  And, by vast 
>> > majority, aside from oom on very large files, we aren't seeing 
>> > these problems any more in our 3 million doc corpus (y, I know, 
>> > small by today's
>> > standards) from
>> > govdocs1 and Common Crawl over on our Rackspace vm.
>> > >
>> > > Given my focus on Tika, I'm overly sensitive to the worst case
>> > scenarios.  I find it encouraging, Erick, that you haven't seen 
>> > these types of problems, that users aren't complaining too often 
>> > about catastrophic failures of Tika within Solr Cell, and that this 
>> > thread is not yet swamped with integrators agreeing with me. :)
>> > >
>> > > However, because oom can leave memory in a corrupted state 
>> > > (right?),
>> > because you can't actually kill a thread for a permanent hang and 
>> > because Tika is a kitchen sink and we can't prevent memory leaks in 
>> > our dependencies, one needs to be aware that bad things can 
>> > happen...if only very, very rarely.  For a fel

Re: edismax query parser - pf field question

2016-02-11 Thread Erick Erickson
Try comma instead of space delimiting?

On Thu, Feb 11, 2016 at 2:33 PM, Senthil  wrote:
> Clarification needed on edismax query parser "pf" field.
>
> *SOLR Query:*
> /query?q=refrigerator water filter&qf=P_NAME^1.5
> CategoryName&wt=xml&debugQuery=on&pf=P_NAME
> CategoryName&mm=2&fl=CategoryName P_NAME score&defType=edismax
>
> *Parsed Query from DebugQuery results:*
> (+((DisjunctionMaxQuery((P_NAME:refriger^1.5 |
> CategoryName:refrigerator)) DisjunctionMaxQuery((P_NAME:water^1.5 |
> CategoryName:water)) DisjunctionMaxQuery((P_NAME:filter^1.5 |
> CategoryName:filter)))~2) DisjunctionMaxQuery((P_NAME:"refriger water
> filter")))/no_coord
>
> In the SOLR query given above, I am asking for phrase matches on 2 fields:
> P_NAME and CategoryName.
> But If you notice ParsedQuery, I see Phrase match is applied only on P_NAME
> field but not on CategoryName field. Why?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/edismax-query-parser-pf-field-question-tp4256845.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How is Tika used with Solr

2016-02-11 Thread Erick Erickson
Well, I'd imagine you could spawn threads and monitor/kill them as
necessary, although that doesn't deal with OOM errors

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav  wrote:
> For sure, if I need heavy duty text extraction again, Tika would be the
> obvious choice if it covers dealing with hangs. I never used tika-server
> myself (not sure if it existed at the time) just used tika from my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar  
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
>> creates a parent and child process, if the child process notices a hung
>> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
>> the child process and kills it out of self preservation, the parent
>> restarts the child, or if there's an OOM...and you can configure how often
>> the child shuts itself down (with parental restarting) to mitigate memory
>> leaks.
>>
>> So, y, if your use case allows  , then we now have
>> that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any
>> interest in that?
>>
>>
>> -Original Message-
>> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user 
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of files,
>> in the end you find stuff (pdfs are typically nasty) that will hang tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a
>> watchdog feature to kill what seemed like a hanged extracting thread. That
>> feature is super important for a robust text extracting pipeline. Has Tika
>> gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, if
>> > you use the simple "run Tika in a SolrJ program" approach you _must_
>> > abort the program on OOM errors and the like and  figure out what's
>> > going on with the offending document(s). Or record the name somewhere
>> > and skip it next time 'round. Or
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all
>> > times and _especially_ where you don't control the document corpus,
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > 
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of the
>> > > time
>> > (regular catchable exceptions), that'll work.  And, by vast majority,
>> > aside from oom on very large files, we aren't seeing these problems
>> > any more in our 3 million doc corpus (y, I know, small by today's
>> > standards) from
>> > govdocs1 and Common Crawl over on our Rackspace vm.
>> > >
>> > > Given my focus on Tika, I'm overly sensitive to the worst case
>> > scenarios.  I find it encouraging, Erick, that you haven't seen these
>> > types of problems, that users aren't complaining too often about
>> > catastrophic failures of Tika within Solr Cell, and that this thread
>> > is not yet swamped with integrators agreeing with me. :)
>> > >
>> > > However, because oom can leave memory in a corrupted state (right?),
>> > because you can't actually kill a thread for a permanent hang and
>> > because Tika is a kitchen sink and we can't prevent memory leaks in
>> > our dependencies, one needs to be aware that bad things can
>> > happen...if only very, very rarely.  For a fellow traveler who has run
>> > into these issues on massive data sets, see also [0].
>> > >
>> > > Configuring Hadoop to work around these types of problems is not too
>> > difficult -- it has to be done with some thought, though.  On
>> > conventional single box setups, the ForkParser within Tika is one
>> > option, tika-batch is another.  Hand rolling your own parent/child
>> > process is non-trivial and is not necessary for the vast majority of use
>> cases.
>> > >
>> > >
>> > > [0]
>> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>> > eb-content-nanite/
>> > >
>> > >
>> > >
>> > > -Original Message-
>> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
>> > > Sent: Tuesday, February 09, 2016 10:05 PM
>> > > To: solr-user 
>> > > Subject: Re: How is Tika used with Solr
>> > >
>> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
>> > exceptions in my code and "do the right thing". I'm not sure I see any
>> > real benefit in yet another JVM.
>> > >
>> > > FWIW,
>> > > Erick
>> > >
>> > > On Tue, Feb 9,

Re: SolrCloud shard marked as down and "reloading" collection doesnt restore it

2016-02-11 Thread KNitin
After  more debugging, I figured out that it is related to this:
https://issues.apache.org/jira/browse/SOLR-3274

Is there a recommended fix (apart from running a zk ensemble?)

On Thu, Feb 11, 2016 at 10:29 AM, KNitin  wrote:

> Hi,
>
>  I noticed while running an indexing job (2M docs but per doc size could
> be 2-3 MB) that one of the shards goes down just after the commit.  (Not
> related to OOM or high cpu/load).  This marks the shard as "down" in zk and
> even a reload of the collection does not recover the state.
>
> There are no exceptions in the logs and the stack trace indicates jetty
> threads in blocked state.
>
> The last few lines in the logs are as follows:
>
> trib=TOLEADER&wt=javabin&version=2} {add=[1552605 (1525453861590925312)]}
> 0 5
> INFO  - 2016-02-06 19:17:47.658;
> org.apache.solr.update.DirectUpdateHandler2; start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> INFO  - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy;
> SolrDeletionPolicy.onCommit: commits: num=2
> INFO  - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy;
> newest commit generation = 6
> INFO  - 2016-02-06 19:18:02.233; org.apache.solr.search.SolrIndexSearcher;
> Opening Searcher@321a0cc9 main
> INFO  - 2016-02-06 19:18:02.296; org.apache.solr.core.QuerySenderListener;
> QuerySenderListener sending requests to Searcher@321a0cc9
> main{StandardDirectoryReader(segments_6:180:nrt
> _20(4.6):C15155/216:delGen=1 _w(4.6):C1538/63:delGen=2
> _16(4.6):C279/20:delGen=2 _e(4.6):C11386/514:delGen=3
> _g(4.6):C4434/204:delGen=3 _p(4.6):C418/5:delGen=1 _v(4.6):C1
> _x(4.6):C17583/316:delGen=2 _y(4.6):C9783/112:delGen=2
> _z(4.6):C4736/47:delGen=2 _12(4.6):C705/2:delGen=1 _13(4.6):C275/4:delGen=1
> _1b(4.6):C619 _26(4.6):C318/13:delGen=1 _1e(4.6):C25356/763:delGen=3
> _1f(4.6):C13024/426:delGen=2 _1g(4.6):C5368/142:delGen=2
> _1j(4.6):C499/16:delGen=2 _1m(4.6):C448/23:delGen=2
> _1p(4.6):C236/17:delGen=2 _1k(4.6):C173/5:delGen=1
> _1s(4.6):C1082/78:delGen=2 _1t(4.6):C195/17:delGen=2 _1u(4.6):C2
> _21(4.6):C16494/1278:delGen=1 _22(4.6):C5193/398:delGen=1
> _23(4.6):C1361/102:delGen=1 _24(4.6):C475/36:delGen=1
> _29(4.6):C126/11:delGen=1 _2d(4.6):C97/3:delGen=1 _27(4.6):C59/7:delGen=1
> _28(4.6):C26/6:delGen=1 _2b(4.6):C40 _25(4.6):C39/1:delGen=1
> _2c(4.6):C139/9:delGen=1 _2a(4.6):C26/6:delGen=1)}
>
>
> The only solution is to restart the cluster. Why does a reload not work
> and is this a known bug (for which there is a patch i can apply)?
>
> Any pointers are much appreciated
>
> Thanks!
> Nitin
>


Re: Select distinct records

2016-02-11 Thread Brian Narsi
In order to use the Collapsing feature I will need to use Document Routing
to co-locate related documents in the same shard in SolrCloud. What are the
advantages and disadvantages of Document Routing?

Thanks,

On Thu, Feb 11, 2016 at 12:54 PM, Joel Bernstein  wrote:

> Yeah that would be the reason. If you want distributed unique capabilities,
> then you might want to start testing out 6.0. Aside from SELECT DISTINCT
> queries, you also have a much more mature Streaming Expression library
> which supports the unique operation.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Feb 11, 2016 at 12:28 PM, Brian Narsi  wrote:
>
> > Ok I see that Collapsing features requires documents to be co-located in
> > the same shard in SolrCloud.
> >
> > Could that be a reason for duplication?
> >
> > On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein 
> > wrote:
> >
> > > The CollapsingQParserPlugin shouldn't have duplicates in the result
> set.
> > > Can you provide the details?
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi 
> > wrote:
> > >
> > > > I have tried to use the Collapsing feature but it appears that it
> > leaves
> > > > duplicated records in the result set.
> > > >
> > > > Is that expected? Or any suggestions on working around it?
> > > >
> > > > Thanks
> > > >
> > > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi 
> > wrote:
> > > >
> > > > > I am using
> > > > >
> > > > > Solr 5.1.0
> > > > >
> > > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal <
> binoydala...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> What version of Solr are you using?
> > > > >> Have you taken a look at the Collapsing Query Parser. It basically
> > > > >> performs
> > > > >> the same functions as grouping but is much more efficient at doing
> > it.
> > > > >> Take a look here:
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> > > > >>
> > > > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi 
> > > wrote:
> > > > >>
> > > > >> > I am trying to select distinct records from a collection. (I
> need
> > > > >> distinct
> > > > >> > name and corresponding id)
> > > > >> >
> > > > >> > I have tried using grouping and group format of simple but that
> > > takes
> > > > a
> > > > >> > long time to execute and sometimes runs into out of memory
> > > exception.
> > > > >> > Another limitation seems to be that total number of groups are
> not
> > > > >> > returned.
> > > > >> >
> > > > >> > Is there another faster and more efficient way to do this?
> > > > >> >
> > > > >> > Thank you
> > > > >> >
> > > > >> --
> > > > >> Regards,
> > > > >> Binoy Dalal
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: How is Tika used with Solr

2016-02-11 Thread xavi jmlucjav
For sure, if I need heavy duty text extraction again, Tika would be the
obvious choice if it covers dealing with hangs. I never used tika-server
myself (not sure if it existed at the time) just used tika from my own jvm.

On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar  
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows  , then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -Original Message-
> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user 
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > 
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user 
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> > exceptions in my code and "do the right thing". I'm not sure I see any
> > real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > 
> > wrote:
> > >> I have one answer here [0], but I'd be interested to hear what Solr
> > users/devs/integrators have experienced on this topic.
> > >>
> > >> [0]
> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> > >> Y1P
> > >> R09MB0795EAED9

Re: How is Tika used with Solr

2016-02-11 Thread Steven White
Tim,

In my case, I have to use Tika as follows:

java -jar tika-app.jar -t 

I will be invoking the above command from my Java app
using Runtime.getRuntime().exec().  I will capture stdout and stderr to get
back the raw text i need.  My app use case will not allow me to use a
 , it is out of the question.

Reading your summary, it looks like I won't get this watch-dog monitoring
and thus I have to implement my own.  Can you confirm?

Thanks

Steve


On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. 
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar  
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows  , then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -Original Message-
> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user 
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > 
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user 
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> > exceptions in my code and "do the right thing". I'm not sure I see any
> > real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > 
> > wrote:
> > >> I have one answ

edismax query parser - pf field question

2016-02-11 Thread Senthil
Clarification needed on edismax query parser "pf" field.

*SOLR Query:*
/query?q=refrigerator water filter&qf=P_NAME^1.5
CategoryName&wt=xml&debugQuery=on&pf=P_NAME
CategoryName&mm=2&fl=CategoryName P_NAME score&defType=edismax

*Parsed Query from DebugQuery results:*
(+((DisjunctionMaxQuery((P_NAME:refriger^1.5 |
CategoryName:refrigerator)) DisjunctionMaxQuery((P_NAME:water^1.5 |
CategoryName:water)) DisjunctionMaxQuery((P_NAME:filter^1.5 |
CategoryName:filter)))~2) DisjunctionMaxQuery((P_NAME:"refriger water
filter")))/no_coord

In the SOLR query given above, I am asking for phrase matches on 2 fields:
P_NAME and CategoryName.
But If you notice ParsedQuery, I see Phrase match is applied only on P_NAME
field but not on CategoryName field. Why?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/edismax-query-parser-pf-field-question-tp4256845.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: slave is getting full synced every polling

2016-02-11 Thread Erick Erickson
Typo? That's 60 seconds, but that's not especially interesting either way.

Do the actual segment's look identical after the polling?

On Thu, Feb 11, 2016 at 1:16 PM, Novin Novin  wrote:
> Hi Erick,
>
> Below is master slave config:
>
> Master:
> 
>  
> commit
> optimize
> 
> 2
>   
>
> Slave:
> 
> 
>   
>   http://master:8983/solr/big_core/replication
> 
>   00:00:60
>   username
>   password
>  
>   
>
>
> Do you mean the Solr is restarting every minute or the polling
> interval is 60 seconds?
>
> I meant polling is 60 minutes
>
> I didn't not see any suspicious in logs , and I'm not optimizing any thing
> with commit.
>
> Thanks
> Novin
>
> On Thu, 11 Feb 2016 at 18:02 Erick Erickson  wrote:
>
>> What is your replication configuration in solrconfig.xml on both
>> master and slave?
>>
>> bq:  big core is doing full sync every time wherever it start (every
>> minute).
>>
>> Do you mean the Solr is restarting every minute or the polling
>> interval is 60 seconds?
>>
>> The Solr logs should tell you something about what's going on there.
>> Also, if you are for
>> some reason optimizing the index that'll cause a full replication.
>>
>> Best,
>> Erick
>>
>> On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin  wrote:
>> > Hi Guys,
>> >
>> > I'm having a problem with master slave syncing.
>> >
>> > So I have two cores one is small core (just keep data use frequently for
>> > fast results) and another is big core (for rare query and for search in
>> > every thing). both core has same solrconfig file. But small core
>> > replication is fine, other than this big core is doing full sync every
>> time
>> > wherever it start (every minute).
>> >
>> > I found this
>> >
>> http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master
>> >
>> > But not really usefull.
>> >
>> > Solr verion 5.2.0
>> > Small core has doc 10 mil. size around 10 to 15 GB.
>> > Big core has doc greater than 100 mil. size around 25 to 35 GB.
>> >
>> > How can I stop full sync.
>> >
>> > Thanks
>> > Novin
>>


Re: slave is getting full synced every polling

2016-02-11 Thread Novin Novin
Hi Erick,

Below is master slave config:

Master:

 
commit
optimize

2
  

Slave:


  
  http://master:8983/solr/big_core/replication

  00:00:60
  username
  password
 
  


Do you mean the Solr is restarting every minute or the polling
interval is 60 seconds?

I meant polling is 60 minutes

I didn't not see any suspicious in logs , and I'm not optimizing any thing
with commit.

Thanks
Novin

On Thu, 11 Feb 2016 at 18:02 Erick Erickson  wrote:

> What is your replication configuration in solrconfig.xml on both
> master and slave?
>
> bq:  big core is doing full sync every time wherever it start (every
> minute).
>
> Do you mean the Solr is restarting every minute or the polling
> interval is 60 seconds?
>
> The Solr logs should tell you something about what's going on there.
> Also, if you are for
> some reason optimizing the index that'll cause a full replication.
>
> Best,
> Erick
>
> On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin  wrote:
> > Hi Guys,
> >
> > I'm having a problem with master slave syncing.
> >
> > So I have two cores one is small core (just keep data use frequently for
> > fast results) and another is big core (for rare query and for search in
> > every thing). both core has same solrconfig file. But small core
> > replication is fine, other than this big core is doing full sync every
> time
> > wherever it start (every minute).
> >
> > I found this
> >
> http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master
> >
> > But not really usefull.
> >
> > Solr verion 5.2.0
> > Small core has doc 10 mil. size around 10 to 15 GB.
> > Big core has doc greater than 100 mil. size around 25 to 35 GB.
> >
> > How can I stop full sync.
> >
> > Thanks
> > Novin
>


RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Should have looked at how we handle psts before earlier responsesorry.

What you're seeing is Tika's default treatment of embedded documents, it 
concatenates them all into one string.  It'll do the same thing for zip files 
and other container files.  The default Tika format is xhtml, and we include 
tags that show you where the attachments are.  If the tags are stripped, then 
you only get a big blob of text, which is often all that's necessary for search.

Before SOLR-7189, you wouldn't have gotten any content, so that's 
progress...right?

Some options for now:
1) use java-libpst as a preprocessing step to extract contents from your psts 
before you ingest them in Solr (feel free to borrow code from our 
OutlookPSTParser).
2) use tika from the commandline with the -J -t options to get a Json 
representation of the overall file, which includes a list of maps, where each 
map represents a single embedded file.  Again, if you have any questions on 
this, head over to u...@tika.apache.org

I think what you want is something along the lines of SOLR-7229, which would 
treat each embedded document as its own document.  That issue is not resolved, 
and there's currently no way of doing this within DIH that I'm aware of.

If others on this list have an interest in SOLR-7229, let me know, and I'll try 
to find some time.  I'd need feedback on some design decisions.





-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Thursday, February 11, 2016 1:43 PM
To: solr-user@lucene.apache.org
Subject: outlook email file pst extraction problem

Hi ,
   I am currently indexing individual outlook messages and searching is 
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages into 
one message. I am not getting all tags when extracted individual messgaes. is 
above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.

--sreenivasa kallu


RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
x-post to Tika user's

Y and n.  If you run tika app as: 

java -jar tika-app.jar  

It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This 
creates a parent and child process, if the child process notices a hung thread, 
it dies, and the parent restarts it.  Or if your OS gets upset with the child 
process and kills it out of self preservation, the parent restarts the child, 
or if there's an OOM...and you can configure how often the child shuts itself 
down (with parental restarting) to mitigate memory leaks.

So, y, if your use case allows  , then we now have that 
in Tika.

I've been wanting to add a similar watchdog to tika-server ... any interest in 
that?


-Original Message-
From: xavi jmlucjav [mailto:jmluc...@gmail.com] 
Sent: Thursday, February 11, 2016 2:16 PM
To: solr-user 
Subject: Re: How is Tika used with Solr

I have found that when you deal with large amounts of all sort of files, in the 
end you find stuff (pdfs are typically nasty) that will hang tika. That is even 
worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog 
feature to kill what seemed like a hanged extracting thread. That feature is 
super important for a robust text extracting pipeline. Has Tika gained such 
feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if 
> you use the simple "run Tika in a SolrJ program" approach you _must_ 
> abort the program on OOM errors and the like and  figure out what's 
> going on with the offending document(s). Or record the name somewhere 
> and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing 
> with errors one at a time.
> For robust systems where you have to have indexing available at all 
> times and _especially_ where you don't control the document corpus, 
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> 
> wrote:
> > I completely agree on the impulse, and for the vast majority of the 
> > time
> (regular catchable exceptions), that'll work.  And, by vast majority, 
> aside from oom on very large files, we aren't seeing these problems 
> any more in our 3 million doc corpus (y, I know, small by today's 
> standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these 
> types of problems, that users aren't complaining too often about 
> catastrophic failures of Tika within Solr Cell, and that this thread 
> is not yet swamped with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and 
> because Tika is a kitchen sink and we can't prevent memory leaks in 
> our dependencies, one needs to be aware that bad things can 
> happen...if only very, very rarely.  For a fellow traveler who has run 
> into these issues on massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On 
> conventional single box setups, the ForkParser within Tika is one 
> option, tika-batch is another.  Hand rolling your own parent/child 
> process is non-trivial and is not necessary for the vast majority of use 
> cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user 
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any 
> real benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> > 
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> >> Y1P 
> >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> >> tlo
> >> ok.com%3E
> >>
> >> -Original Message-
> >> From: Steven White [mailto:swhite4...@gmail.com]
> >> Sent: Tuesday, February 09, 2016 6:33 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How is Tika used with Solr
> >>
> >> Thank you Erick and Alex.
> >>
> >> My main question is with a long running process using Tika in the 
> >> same
> JVM as my application.  I'm running my file-s

Re: How is Tika used with Solr

2016-02-11 Thread xavi jmlucjav
I have found that when you deal with large amounts of all sort of files, in
the end you find stuff (pdfs are typically nasty) that will hang tika. That
is even worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog
feature to kill what seemed like a hanged extracting thread. That feature
is super important for a robust text extracting pipeline. Has Tika gained
such feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if
> you use the simple
> "run Tika in a SolrJ program" approach you _must_ abort the program on
> OOM errors
> and the like and  figure out what's going on with the offending
> document(s). Or record the
> name somewhere and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing
> with errors one at a time.
> For robust systems where you have to have indexing available at all
> times and _especially_
> where you don't control the document corpus, you have to build
> something far more
> tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> wrote:
> > I completely agree on the impulse, and for the vast majority of the time
> (regular catchable exceptions), that'll work.  And, by vast majority, aside
> from oom on very large files, we aren't seeing these problems any more in
> our 3 million doc corpus (y, I know, small by today's standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these types
> of problems, that users aren't complaining too often about catastrophic
> failures of Tika within Solr Cell, and that this thread is not yet swamped
> with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and because
> Tika is a kitchen sink and we can't prevent memory leaks in our
> dependencies, one needs to be aware that bad things can happen...if only
> very, very rarely.  For a fellow traveler who has run into these issues on
> massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On conventional
> single box setups, the ForkParser within Tika is one option, tika-batch is
> another.  Hand rolling your own parent/child process is non-trivial and is
> not necessary for the vast majority of use cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user 
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any real
> benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P
> >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo
> >> ok.com%3E
> >>
> >> -Original Message-
> >> From: Steven White [mailto:swhite4...@gmail.com]
> >> Sent: Tuesday, February 09, 2016 6:33 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How is Tika used with Solr
> >>
> >> Thank you Erick and Alex.
> >>
> >> My main question is with a long running process using Tika in the same
> JVM as my application.  I'm running my file-system-crawler in its own JVM
> (not Solr's).  On Tika mailing list, it is suggested to run Tika's code in
> it's own JVM and invoke it from my file-system-crawler using
> Runtime.getRuntime().exec().
> >>
> >> I fully understand from Alex suggestion and link provided by Erick to
> use Tika outside Solr.  But what about using Tika within the same JVM as my
> file-system-crawler application or should I be making a system call to
> invoke another JAR, that runs in its own JVM to extract the raw text?  Are
> there known issues with Tika when used in a long running process?
> >>
> >> Steve
> >>
> >>
>


RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Y, this looks like a Tika feature.  If you run the tika-app.jar [1]on your file 
and you get the same output, then that's Tika's doing.

Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs.

-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Thursday, February 11, 2016 1:43 PM
To: solr-user@lucene.apache.org
Subject: outlook email file pst extraction problem

Hi ,
   I am currently indexing individual outlook messages and searching is 
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages into 
one message. I am not getting all tags when extracted individual messgaes. is 
above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.

--sreenivasa kallu


Re: Select distinct records

2016-02-11 Thread Joel Bernstein
Yeah that would be the reason. If you want distributed unique capabilities,
then you might want to start testing out 6.0. Aside from SELECT DISTINCT
queries, you also have a much more mature Streaming Expression library
which supports the unique operation.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 11, 2016 at 12:28 PM, Brian Narsi  wrote:

> Ok I see that Collapsing features requires documents to be co-located in
> the same shard in SolrCloud.
>
> Could that be a reason for duplication?
>
> On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein 
> wrote:
>
> > The CollapsingQParserPlugin shouldn't have duplicates in the result set.
> > Can you provide the details?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi 
> wrote:
> >
> > > I have tried to use the Collapsing feature but it appears that it
> leaves
> > > duplicated records in the result set.
> > >
> > > Is that expected? Or any suggestions on working around it?
> > >
> > > Thanks
> > >
> > > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi 
> wrote:
> > >
> > > > I am using
> > > >
> > > > Solr 5.1.0
> > > >
> > > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal  >
> > > > wrote:
> > > >
> > > >> What version of Solr are you using?
> > > >> Have you taken a look at the Collapsing Query Parser. It basically
> > > >> performs
> > > >> the same functions as grouping but is much more efficient at doing
> it.
> > > >> Take a look here:
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> > > >>
> > > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi 
> > wrote:
> > > >>
> > > >> > I am trying to select distinct records from a collection. (I need
> > > >> distinct
> > > >> > name and corresponding id)
> > > >> >
> > > >> > I have tried using grouping and group format of simple but that
> > takes
> > > a
> > > >> > long time to execute and sometimes runs into out of memory
> > exception.
> > > >> > Another limitation seems to be that total number of groups are not
> > > >> > returned.
> > > >> >
> > > >> > Is there another faster and more efficient way to do this?
> > > >> >
> > > >> > Thank you
> > > >> >
> > > >> --
> > > >> Regards,
> > > >> Binoy Dalal
> > > >>
> > > >
> > > >
> > >
> >
>


Re: Custom auth plugin not loaded in SolrCloud

2016-02-11 Thread Noble Paul
yes, runtime lib cannot be used for loading container level plugins
yet. Eventually they must. You can open a ticket

On Mon, Jan 4, 2016 at 1:07 AM, tine-2  wrote:
> Hi,
>
> are there any news on this? Was anyone able to get it to work?
>
> Cheers,
>
> tine
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Custom-auth-plugin-not-loaded-in-SolrCloud-tp4245670p4248340.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
-
Noble Paul


outlook email file pst extraction problem

2016-02-11 Thread Sreenivasa Kallu
Hi ,
   I am currently indexing individual outlook messages and searching is
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages
into one message. I am not getting all tags when extracted individual
messgaes. is above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.

--sreenivasa kallu


SolrCloud shard marked as down and "reloading" collection doesnt restore it

2016-02-11 Thread KNitin
Hi,

 I noticed while running an indexing job (2M docs but per doc size could be
2-3 MB) that one of the shards goes down just after the commit.  (Not
related to OOM or high cpu/load).  This marks the shard as "down" in zk and
even a reload of the collection does not recover the state.

There are no exceptions in the logs and the stack trace indicates jetty
threads in blocked state.

The last few lines in the logs are as follows:

trib=TOLEADER&wt=javabin&version=2} {add=[1552605 (1525453861590925312)]} 0
5
INFO  - 2016-02-06 19:17:47.658;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2
INFO  - 2016-02-06 19:18:02.209; org.apache.solr.core.SolrDeletionPolicy;
newest commit generation = 6
INFO  - 2016-02-06 19:18:02.233; org.apache.solr.search.SolrIndexSearcher;
Opening Searcher@321a0cc9 main
INFO  - 2016-02-06 19:18:02.296; org.apache.solr.core.QuerySenderListener;
QuerySenderListener sending requests to Searcher@321a0cc9
main{StandardDirectoryReader(segments_6:180:nrt
_20(4.6):C15155/216:delGen=1 _w(4.6):C1538/63:delGen=2
_16(4.6):C279/20:delGen=2 _e(4.6):C11386/514:delGen=3
_g(4.6):C4434/204:delGen=3 _p(4.6):C418/5:delGen=1 _v(4.6):C1
_x(4.6):C17583/316:delGen=2 _y(4.6):C9783/112:delGen=2
_z(4.6):C4736/47:delGen=2 _12(4.6):C705/2:delGen=1 _13(4.6):C275/4:delGen=1
_1b(4.6):C619 _26(4.6):C318/13:delGen=1 _1e(4.6):C25356/763:delGen=3
_1f(4.6):C13024/426:delGen=2 _1g(4.6):C5368/142:delGen=2
_1j(4.6):C499/16:delGen=2 _1m(4.6):C448/23:delGen=2
_1p(4.6):C236/17:delGen=2 _1k(4.6):C173/5:delGen=1
_1s(4.6):C1082/78:delGen=2 _1t(4.6):C195/17:delGen=2 _1u(4.6):C2
_21(4.6):C16494/1278:delGen=1 _22(4.6):C5193/398:delGen=1
_23(4.6):C1361/102:delGen=1 _24(4.6):C475/36:delGen=1
_29(4.6):C126/11:delGen=1 _2d(4.6):C97/3:delGen=1 _27(4.6):C59/7:delGen=1
_28(4.6):C26/6:delGen=1 _2b(4.6):C40 _25(4.6):C39/1:delGen=1
_2c(4.6):C139/9:delGen=1 _2a(4.6):C26/6:delGen=1)}


The only solution is to restart the cluster. Why does a reload not work and
is this a known bug (for which there is a patch i can apply)?

Any pointers are much appreciated

Thanks!
Nitin


ApacheCon NA 2016 - Important Dates!!!

2016-02-11 Thread Melissa Warnkin
 Hello everyone!
I hope this email finds you well.  I hope everyone is as excited about 
ApacheCon as I am!
I'd like to remind you all of a couple of important dates, as well as ask for 
your assistance in spreading the word! Please use your social media platform(s) 
to get the word out! The more visibility, the better ApacheCon will be for 
all!! :)
CFP Close: February 12, 2016CFP Notifications: February 29, 2016Schedule 
Announced: March 3, 2016
To submit a talk, please visit:  
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp

Link to the main site can be found here:  
http://events.linuxfoundation.org/events/apache-big-data-north-america

Apache: Big Data North America 2016 Registration Fees:
Attendee Registration Fee: US$599 through March 6, US$799 through April 10, 
US$999 thereafterCommitter Registration Fee: US$275 through April 10, US$375 
thereafterStudent Registration Fee: US$275 through April 10, $375 thereafter
Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an 
add-on option on the registration form to join the conference for a discounted 
fee of US$399, available only to Apache: Big Data North America attendees.
So, please tweet away!!
I look forward to seeing you in Vancouver! Have a groovy day!!
~Melissaon behalf of the ApacheCon Team






Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-11 Thread Walter Underwood
I first wrote the “fall back to one at a time” code for Solr 1.3.

It is pretty easy if you plan for it. Make the batch size variable. When a 
batch fails, retry with a batch size of 1 for that particular batch. Then keep 
going or fail, either way, you have good logging on which one failed.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 11, 2016, at 10:06 AM, Erick Erickson  wrote:
> 
> Steven's solution is a very common one, complete to the
> notion of re-chunking. Depending on the throughput requirements,
> simply resending the offending packet one at a time is often
> sufficient (but not _efficient). I can imagine fallback scenarios
> like "try chunking 100 at a time, for those chunks that fail
> do 10 at a time and for those do 1 at a time".
> 
> That said, in a lot of situations, the number of failures is low
> enough that just falling back to one at a time while not elegant
> is sufficient
> 
> It sure will be nice to have SOLR-445 done, if we can just keep
> Hoss from going crazy before he gets done.
> 
> Best,
> Erick
> 
> On Thu, Feb 11, 2016 at 7:39 AM, Steven White  wrote:
>> For my application, the solution I implemented is I log the chunk that
>> failed into a file.  This file is than post processed one record at a
>> time.  The ones that fail, are reported to the admin and never looked at
>> again until the admin takes action.  This is not the most efficient
>> solution right now but I intend to refactor this code so that the failed
>> chunk is itself re-processed in smaller chunks till the chunk with the
>> failed record(s) is down to 1 record "chunk" that will fail.
>> 
>> Like Debraj, I would love to hear from others how they handle such failures.
>> 
>> Steve
>> 
>> 
>> On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna 
>> wrote:
>> 
>>> Thanks Erik. How do people handle this scenario? Right now the only option
>>> I can think of is to replay the entire batch by doing add for every single
>>> doc. Then this will give me error for all the docs which got added from the
>>> batch.
>>> 
>>> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson 
>>> wrote:
>>> 
 This has been a long standing issue, Hoss is doing some current work on
>>> it
 see:
 https://issues.apache.org/jira/browse/SOLR-445
 
 But the short form is "no, not yet".
 
 Best,
 Erick
 
 On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna 
 wrote:
> Hi,
> 
> 
> 
> I have a Document Centric Versioning Constraints added in solr schema:-
> 
> 
>  false
>  doc_version
> 
> 
> I am adding multiple documents in solr in a single call using SolrJ
>>> 5.2.
> The code fragment looks something like below :-
> 
> 
> try {
>UpdateResponse resp = solrClient.add(docs.getDocCollection(),
>500);
>if (resp.getStatus() != 0) {
>throw new Exception(new StringBuilder(
>"Failed to add docs in solr ").append(resp.toString())
>.toString());
>}
>} catch (Exception e) {
>logError("Adding docs to solr failed", e);
>}
> 
> 
> If one of the document is violating the versioning constraints then
>>> Solr
 is
> returning an exception with error message like "user version is not
>>> high
> enough: 1454587156" & the other documents are getting added perfectly.
>>> Is
> there a way I can know which document is violating the constraints
>>> either
> in Solr logs or from the Update response returned by Solr?
> 
> Thanks
 
>>> 



Re: Tune Data Import Handler to retrieve maximum records

2016-02-11 Thread Erick Erickson
It's possible with JDBC settings (see the specific ones for your
drive), but dangerous. What if the number of rows is 1B or something?
You'll blow Solr's memory out of the water

Best,
Erick

On Wed, Feb 10, 2016 at 12:45 PM, Troy Edwards  wrote:
> Is it possible for the Data Import Handler to bring in maximum number of
> records depending on available resources? If so, how should it be
> configured?
>
> Thanks,


Re: Need to move on SOlr cloud (help required)

2016-02-11 Thread Erick Erickson
bq: We want the hits on solr servers to be distributed

True, this happens automatically in SolrCloud, but a simple load
balancer in front of master/slave does the same thing.

bq: what if master node fail what should be our fail over strategy  ?

This is, indeed one of the advantages for SolrCloud, you don't have
to worry about this any more.

Another benefit (and you haven't touched on whether this matters)
is that in SolrCloud you do not have the latency of polling and
replicating from master to slave, in other words it supports Near Real Time.

This comes at some additional complexity however. If you have
your master node failing often enough to be a problem, you have
other issues ;)...

And the recovery strategy if the master fails is straightforward:
1> pick one of the slaves to be the master.
2> update the other nodes to point to the new master
3> re-index the docs from before the old master failed to the new master.

You can use system variables to not even have to manually edit all of the
solrconfig files, just supply different -D parameters on startup.

Best,
Erick

On Wed, Feb 10, 2016 at 10:39 PM, kshitij tyagi
 wrote:
> @Jack
>
> Currently we have around 55,00,000 docs
>
> Its not about load on one node we have load on different nodes at different
> times as our traffic is huge around 60k users at a given point of time
>
> We want the hits on solr servers to be distributed so we are planning to
> move on solr cloud as it would be fault tolerant.
>
>
>
> On Thu, Feb 11, 2016 at 11:10 AM, Midas A  wrote:
>
>> hi,
>> what if master node fail what should be our fail over strategy  ?
>>
>> On Wed, Feb 10, 2016 at 9:12 PM, Jack Krupansky 
>> wrote:
>>
>> > What exactly is your motivation? I mean, the primary benefit of SolrCloud
>> > is better support for sharding, and you have only a single shard. If you
>> > have no need for sharding and your master-slave replicated Solr has been
>> > working fine, then stick with it. If only one machine is having a load
>> > problem, then that one node should be replaced. There are indeed plenty
>> of
>> > good reasons to prefer SolrCloud over traditional master-slave
>> replication,
>> > but so far you haven't touched on any of them.
>> >
>> > How much data (number of documents) do you have?
>> >
>> > What is your typical query latency?
>> >
>> >
>> > -- Jack Krupansky
>> >
>> > On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi <
>> > kshitij.shopcl...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > We are currently using solr 5.2 and I need to move on solr cloud
>> > > architecture.
>> > >
>> > > As of now we are using 5 machines :
>> > >
>> > > 1. I am using 1 master where we are indexing ourdata.
>> > > 2. I replicate my data on other machines
>> > >
>> > > One or the other machine keeps on showing high load so I am planning to
>> > > move on solr cloud.
>> > >
>> > > Need help on following :
>> > >
>> > > 1. What should be my architecture in case of 5 machines to keep
>> > (zookeeper,
>> > > shards, core).
>> > >
>> > > 2. How to add a node.
>> > >
>> > > 3. what are the exact steps/process I need to follow in order to change
>> > to
>> > > solr cloud.
>> > >
>> > > 4. How indexing will work in solr cloud as of now I am using mysql
>> query
>> > to
>> > > get the data on master and then index the same (how I need to change
>> this
>> > > in case of solr cloud).
>> > >
>> > > Regards,
>> > > Kshitij
>> > >
>> >
>>


Re: Size of logs are high

2016-02-11 Thread Erick Erickson
You can also look at hour log4j properties file and manipulate the
max log size, how many old versions are retained etc.

If you're talking about the console log, people often just disable
console logging (again in the logging properties file).

Best,
Erick

On Thu, Feb 11, 2016 at 6:11 AM, Aditya Sundaram
 wrote:
> Can you check your log level? Probably log level of error would suffice for
> your purpose and it would most certainly reduce your log size(s).
>
> On Thu, Feb 11, 2016 at 12:53 PM, kshitij tyagi > wrote:
>
>> Hi,
>> I have migrated to solr 5.2 and the size of logs are high.
>>
>> Can anyone help me out here how to control this?
>>
>
>
>
> --
> Aditya Sundaram
> Software Engineer, Technology team
> AKR Tech park B Block, B1 047
> +91-9844006866


Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-11 Thread Erick Erickson
Steven's solution is a very common one, complete to the
notion of re-chunking. Depending on the throughput requirements,
simply resending the offending packet one at a time is often
sufficient (but not _efficient). I can imagine fallback scenarios
like "try chunking 100 at a time, for those chunks that fail
do 10 at a time and for those do 1 at a time".

That said, in a lot of situations, the number of failures is low
enough that just falling back to one at a time while not elegant
is sufficient

It sure will be nice to have SOLR-445 done, if we can just keep
Hoss from going crazy before he gets done.

Best,
Erick

On Thu, Feb 11, 2016 at 7:39 AM, Steven White  wrote:
> For my application, the solution I implemented is I log the chunk that
> failed into a file.  This file is than post processed one record at a
> time.  The ones that fail, are reported to the admin and never looked at
> again until the admin takes action.  This is not the most efficient
> solution right now but I intend to refactor this code so that the failed
> chunk is itself re-processed in smaller chunks till the chunk with the
> failed record(s) is down to 1 record "chunk" that will fail.
>
> Like Debraj, I would love to hear from others how they handle such failures.
>
> Steve
>
>
> On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna 
> wrote:
>
>> Thanks Erik. How do people handle this scenario? Right now the only option
>> I can think of is to replay the entire batch by doing add for every single
>> doc. Then this will give me error for all the docs which got added from the
>> batch.
>>
>> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson 
>> wrote:
>>
>> > This has been a long standing issue, Hoss is doing some current work on
>> it
>> > see:
>> > https://issues.apache.org/jira/browse/SOLR-445
>> >
>> > But the short form is "no, not yet".
>> >
>> > Best,
>> > Erick
>> >
>> > On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna 
>> > wrote:
>> > > Hi,
>> > >
>> > >
>> > >
>> > > I have a Document Centric Versioning Constraints added in solr schema:-
>> > >
>> > > 
>> > >   false
>> > >   doc_version
>> > > 
>> > >
>> > > I am adding multiple documents in solr in a single call using SolrJ
>> 5.2.
>> > > The code fragment looks something like below :-
>> > >
>> > >
>> > > try {
>> > > UpdateResponse resp = solrClient.add(docs.getDocCollection(),
>> > > 500);
>> > > if (resp.getStatus() != 0) {
>> > > throw new Exception(new StringBuilder(
>> > > "Failed to add docs in solr ").append(resp.toString())
>> > > .toString());
>> > > }
>> > > } catch (Exception e) {
>> > > logError("Adding docs to solr failed", e);
>> > > }
>> > >
>> > >
>> > > If one of the document is violating the versioning constraints then
>> Solr
>> > is
>> > > returning an exception with error message like "user version is not
>> high
>> > > enough: 1454587156" & the other documents are getting added perfectly.
>> Is
>> > > there a way I can know which document is violating the constraints
>> either
>> > > in Solr logs or from the Update response returned by Solr?
>> > >
>> > > Thanks
>> >
>>


Re: slave is getting full synced every polling

2016-02-11 Thread Erick Erickson
What is your replication configuration in solrconfig.xml on both
master and slave?

bq:  big core is doing full sync every time wherever it start (every minute).

Do you mean the Solr is restarting every minute or the polling
interval is 60 seconds?

The Solr logs should tell you something about what's going on there.
Also, if you are for
some reason optimizing the index that'll cause a full replication.

Best,
Erick

On Thu, Feb 11, 2016 at 8:41 AM, Novin Novin  wrote:
> Hi Guys,
>
> I'm having a problem with master slave syncing.
>
> So I have two cores one is small core (just keep data use frequently for
> fast results) and another is big core (for rare query and for search in
> every thing). both core has same solrconfig file. But small core
> replication is fine, other than this big core is doing full sync every time
> wherever it start (every minute).
>
> I found this
> http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master
>
> But not really usefull.
>
> Solr verion 5.2.0
> Small core has doc 10 mil. size around 10 to 15 GB.
> Big core has doc greater than 100 mil. size around 25 to 35 GB.
>
> How can I stop full sync.
>
> Thanks
> Novin


dismax for bigrams and phrases

2016-02-11 Thread Le Zhao

Hey Solr folks,

Current dismax parser behavior is different for unigrams versus bigrams.

For unigrams, it's MAX-ed across fields (so called dismax), but for 
bigrams, it's SUM-ed from Solr 4.10 (according to 
https://issues.apache.org/jira/browse/SOLR-6062).


Given this inconsistency, the dilemma we are facing now is the following:
for a query with three terms: [A B C]
Relevant doc1: f1:[AB .. C] f2:[BC]   // here AB in field1 and BC in 
field2 are bigrams, and C is a unigram
Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB]  // here only bigram AB is 
present in the doc, but in three different fields.


(A B C here can be e.g. "light blue bag", and doc2 can talk about "light 
blue coat" a lot, while only mentioning a "bag" somewhere.)


Without bigram level MAX across fields, there is no way to rank doc1 
above doc2.
(doc1 is preferred because it hits two different bigrams, while doc2 
only hits one bigram in several different fields.)


Also, being a sum makes the retrieval score difficult to bound, making 
it hard to combine the retrieval score with other document level signals 
(e.g. document quality), or to trade off between unigrams and bigrams.


Are the problems clear?

Can someone offer a solution other than dismax for bigrams/phrases? i.e. 
https://issues.apache.org/jira/browse/SOLR-6600 ?  (SOLR-6600 seems to 
be misclassified as a duplicate of SOLR-6062, while they seem to be the 
exact opposite.)


Thanks,
Le

PS cc'ing Jan who pointed me to the group.


dismax for bigrams and phrases

2016-02-11 Thread Le Zhao

Hey Solr folks,

Current dismax parser behavior is different for unigrams versus bigrams.

For unigrams, it's MAX-ed across fields (so called dismax), but for 
bigrams, it's SUM-ed from Solr 4.10 (according to 
https://issues.apache.org/jira/browse/SOLR-6062).


Given this inconsistency, the dilemma we are facing now is the following:
for a query with three terms: [A B C]
Relevant doc1: f1:[AB .. C] f2:[BC]   // here AB in field1 and BC in 
field2 are bigrams, and C is a unigram
Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB]  // here only bigram AB is 
present in the doc, but in three different fields.


(A B C here can be e.g. "light blue bag", and doc2 can talk about "light 
blue coat" a lot, while only mentioning a "bag" somewhere.)


Without bigram level MAX across fields, there is no way to rank doc1 
above doc2.
(doc1 is preferred because it hits two different bigrams, while doc2 
only hits one bigram in several different fields.)


Also, being a sum makes the retrieval score difficult to bound, making 
it hard to combine the retrieval score with other document level signals 
(e.g. document quality), or to trade off between unigrams and bigrams.


Are the problems clear?

Can someone offer a solution other than dismax for bigrams/phrases? i.e. 
https://issues.apache.org/jira/browse/SOLR-6600 ?  (SOLR-6600 seems to 
be misclassified as a duplicate of SOLR-6062, while they seem to be the 
exact opposite.)


Thanks,
Le


Re: Select distinct records

2016-02-11 Thread Brian Narsi
Ok I see that Collapsing features requires documents to be co-located in
the same shard in SolrCloud.

Could that be a reason for duplication?

On Thu, Feb 11, 2016 at 11:09 AM, Joel Bernstein  wrote:

> The CollapsingQParserPlugin shouldn't have duplicates in the result set.
> Can you provide the details?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi  wrote:
>
> > I have tried to use the Collapsing feature but it appears that it leaves
> > duplicated records in the result set.
> >
> > Is that expected? Or any suggestions on working around it?
> >
> > Thanks
> >
> > On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi  wrote:
> >
> > > I am using
> > >
> > > Solr 5.1.0
> > >
> > > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal 
> > > wrote:
> > >
> > >> What version of Solr are you using?
> > >> Have you taken a look at the Collapsing Query Parser. It basically
> > >> performs
> > >> the same functions as grouping but is much more efficient at doing it.
> > >> Take a look here:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> > >>
> > >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi 
> wrote:
> > >>
> > >> > I am trying to select distinct records from a collection. (I need
> > >> distinct
> > >> > name and corresponding id)
> > >> >
> > >> > I have tried using grouping and group format of simple but that
> takes
> > a
> > >> > long time to execute and sometimes runs into out of memory
> exception.
> > >> > Another limitation seems to be that total number of groups are not
> > >> > returned.
> > >> >
> > >> > Is there another faster and more efficient way to do this?
> > >> >
> > >> > Thank you
> > >> >
> > >> --
> > >> Regards,
> > >> Binoy Dalal
> > >>
> > >
> > >
> >
>


Re: Select distinct records

2016-02-11 Thread Joel Bernstein
The CollapsingQParserPlugin shouldn't have duplicates in the result set.
Can you provide the details?

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 11, 2016 at 12:02 PM, Brian Narsi  wrote:

> I have tried to use the Collapsing feature but it appears that it leaves
> duplicated records in the result set.
>
> Is that expected? Or any suggestions on working around it?
>
> Thanks
>
> On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi  wrote:
>
> > I am using
> >
> > Solr 5.1.0
> >
> > On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal 
> > wrote:
> >
> >> What version of Solr are you using?
> >> Have you taken a look at the Collapsing Query Parser. It basically
> >> performs
> >> the same functions as grouping but is much more efficient at doing it.
> >> Take a look here:
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> >>
> >> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi  wrote:
> >>
> >> > I am trying to select distinct records from a collection. (I need
> >> distinct
> >> > name and corresponding id)
> >> >
> >> > I have tried using grouping and group format of simple but that takes
> a
> >> > long time to execute and sometimes runs into out of memory exception.
> >> > Another limitation seems to be that total number of groups are not
> >> > returned.
> >> >
> >> > Is there another faster and more efficient way to do this?
> >> >
> >> > Thank you
> >> >
> >> --
> >> Regards,
> >> Binoy Dalal
> >>
> >
> >
>


Custom plugin to handle proprietary binary input stream

2016-02-11 Thread michael dürr
I'm looking for an option to write a Solr plugin which can deal with a
custom binary input stream. Unfortunately Solr's javabin as a protocol is
not an option for us.

I already had a look at some possibilities like writing a custom request
handler, but it seems like the classes/interfaces one would need to
implement are not "generic" enough (e.g. the
SolrRequestHandler#handleRequest() method expects objects of classes
SolrQueryRequest and SolrQueryResponse rsp)

It would be of great help if you could direct me to any "pluggable"
solution which allows to receive and parse a proprietary binary stream at a
Solr server so that we do not have to provide an own customized binary solr
server.

Background:
Our problem is that, we use a proprietary protocol to transfer our solr
queries together with some other Java objects to our solr server (at
present 3.6).The reason for this is, that we have some logic at the solr
server which heavily depends on theses other java objects. Unfortunately we
cannot easily shift that logic to the client side.

Thank you!

Michael


Enforce client auth in Solr

2016-02-11 Thread GAUTHAM S
Hello,

I am trying to implement a Solr cluster with mutual authentication using
client and server SSL certificates. I have both client and server
certificates signed by CA. The set up is working good, however any client
cert that chains up to issuer CA are able to access the Solr cluster
without validating the actual client cert that is added to the trust store
of the server.  Is there any way that we could enforce validation of client
cert UID and DC on Solr server to ensure only allowed client certs are able
to access the Solr ?

Solr version used - 4.10.3 and 5.4.1
Container used  - jetty

Thanks in advance.


Regards,
Gautham


Re: Select distinct records

2016-02-11 Thread Brian Narsi
I have tried to use the Collapsing feature but it appears that it leaves
duplicated records in the result set.

Is that expected? Or any suggestions on working around it?

Thanks

On Thu, Feb 11, 2016 at 9:30 AM, Brian Narsi  wrote:

> I am using
>
> Solr 5.1.0
>
> On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal 
> wrote:
>
>> What version of Solr are you using?
>> Have you taken a look at the Collapsing Query Parser. It basically
>> performs
>> the same functions as grouping but is much more efficient at doing it.
>> Take a look here:
>>
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>>
>> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi  wrote:
>>
>> > I am trying to select distinct records from a collection. (I need
>> distinct
>> > name and corresponding id)
>> >
>> > I have tried using grouping and group format of simple but that takes a
>> > long time to execute and sometimes runs into out of memory exception.
>> > Another limitation seems to be that total number of groups are not
>> > returned.
>> >
>> > Is there another faster and more efficient way to do this?
>> >
>> > Thank you
>> >
>> --
>> Regards,
>> Binoy Dalal
>>
>
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Alessandro Benedetti
Out of curiosity, have you tried to debug that solr version to see which
text arrives to the splitOnTokens method ?
In latest solr that part has changed completely.
Would be curious to understand what it tries to tokenise by ? and * !

Cheers

On 11 February 2016 at 16:33, Matteo Grolla  wrote:

> virtual hardware, 200ms is taken on the client until response is written to
> disk
> qtime on solr is ~90ms
> not great but acceptable
>
> Is it possible that the method FilenameUtils.splitOnTokens is really so
> heavy when requesting a lot of rows on slow hardware?
>
> 2016-02-11 17:17 GMT+01:00 Jack Krupansky :
>
> > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> still
> > relatively bad. Even 50ms for 10 rows would be considered barely okay.
> > But... again it depends on query complexity - simple queries should be
> well
> > under 50 ms for decent modern hardware.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla  >
> > wrote:
> >
> > > Hi Jack,
> > >   response time scale with rows. Relationship doens't seem linear
> but
> > > Below 400 rows times are much faster,
> > > I view query times from solr logs and they are fast
> > > the same query with rows = 1000 takes 8s
> > > with rows = 10 takes 0.2s
> > >
> > >
> > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
> > >
> > > > Are queries scaling linearly - does a query for 100 rows take 1/10th
> > the
> > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > >
> > > > Does the app need/expect exactly 1,000 documents for the query or is
> > that
> > > > just what this particular query happened to return?
> > > >
> > > > What does they query look like? Is it complex or use wildcards or
> > > function
> > > > queries, or is it very simple keywords? How many operators?
> > > >
> > > > Have you used the debugQuery=true parameter to see which search
> > > components
> > > > are taking the time?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Yonic,
> > > > >  after the first query I find 1000 docs in the document cache.
> > > > > I'm using curl to send the request and requesting javabin format to
> > > mimic
> > > > > the application.
> > > > > gc activity is low
> > > > > I managed to load the entire 50GB index in the filesystem cache,
> > after
> > > > that
> > > > > queries don't cause disk activity anymore.
> > > > > Time improves now queries that took ~30s take <10s. But I hoped
> > better
> > > > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > > > >
> > > > >
> > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > > > >
> > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to
> execute
> > > the
> > > > > > > query) is a fraction of a second.
> > > > > > > The response in javabin format is around 300k.
> > > > > >
> > > > > > OK, That tells us a lot.
> > > > > > And if you actually tested so that all the docs would be in the
> > cache
> > > > > > (can you verify this by looking at the cache stats after you
> > > > > > re-execute?) then it seems like the slowness is down to any of:
> > > > > > a) serializing the response (it doesn't seem like a 300K response
> > > > > > should take *that* long to serialize)
> > > > > > b) reading/processing the response (how fast the client can do
> > > > > > something with each doc is also a factor...)
> > > > > > c) other (GC, network, etc)
> > > > > >
> > > > > > You can try taking client processing out of the equation by
> trying
> > a
> > > > > > curl request.
> > > > > >
> > > > > > -Yonik
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


slave is getting full synced every polling

2016-02-11 Thread Novin Novin
Hi Guys,

I'm having a problem with master slave syncing.

So I have two cores one is small core (just keep data use frequently for
fast results) and another is big core (for rare query and for search in
every thing). both core has same solrconfig file. But small core
replication is fine, other than this big core is doing full sync every time
wherever it start (every minute).

I found this
http://stackoverflow.com/questions/6435652/solr-replication-keeps-downloading-entire-index-from-master

But not really usefull.

Solr verion 5.2.0
Small core has doc 10 mil. size around 10 to 15 GB.
Big core has doc greater than 100 mil. size around 25 to 35 GB.

How can I stop full sync.

Thanks
Novin


RE: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Markus Jelsma
Awesome! The surrounding braces did the thing. Fixed the quotes just before.
Many thanks!!

The remaining issue is that some source files in o.a.s.search.facet package are 
package protected or private. I can't implement a custom Agg using FacetContext 
and such. Created issue: https://issues.apache.org/jira/browse/SOLR-8673

Thanks again!
Markus
 
 
-Original message-
> From:Yonik Seeley 
> Sent: Thursday 11th February 2016 17:12
> To: solr-user@lucene.apache.org
> Subject: Re: Json faceting, aggregate numeric field by day?
> 
> On Thu, Feb 11, 2016 at 11:07 AM, Markus Jelsma
>  wrote:
> > Hi - i was sending the following value for json.facet:
> > json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap 
> > : "+1DAY", facet:{x : "avg(rank)"}}
> >
> > I now also notice i didn't include the time field. But adding it gives the 
> > same error:
> > json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : 
> > NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}}
> 
> Hmmm, the whole thing is a JSON object, so it needs curly braces
> around the whole thing...
> json.facet={by_day: [...] }
> 
> You may need quotes around the date specs as well (containing slashes,
> etc)... not sure if they will be parsed as a single string or not
> 
> -Yonik
> 


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
virtual hardware, 200ms is taken on the client until response is written to
disk
qtime on solr is ~90ms
not great but acceptable

Is it possible that the method FilenameUtils.splitOnTokens is really so
heavy when requesting a lot of rows on slow hardware?

2016-02-11 17:17 GMT+01:00 Jack Krupansky :

> Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still
> relatively bad. Even 50ms for 10 rows would be considered barely okay.
> But... again it depends on query complexity - simple queries should be well
> under 50 ms for decent modern hardware.
>
> -- Jack Krupansky
>
> On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla 
> wrote:
>
> > Hi Jack,
> >   response time scale with rows. Relationship doens't seem linear but
> > Below 400 rows times are much faster,
> > I view query times from solr logs and they are fast
> > the same query with rows = 1000 takes 8s
> > with rows = 10 takes 0.2s
> >
> >
> > 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
> >
> > > Are queries scaling linearly - does a query for 100 rows take 1/10th
> the
> > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > >
> > > Does the app need/expect exactly 1,000 documents for the query or is
> that
> > > just what this particular query happened to return?
> > >
> > > What does they query look like? Is it complex or use wildcards or
> > function
> > > queries, or is it very simple keywords? How many operators?
> > >
> > > Have you used the debugQuery=true parameter to see which search
> > components
> > > are taking the time?
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> matteo.gro...@gmail.com>
> > > wrote:
> > >
> > > > Hi Yonic,
> > > >  after the first query I find 1000 docs in the document cache.
> > > > I'm using curl to send the request and requesting javabin format to
> > mimic
> > > > the application.
> > > > gc activity is low
> > > > I managed to load the entire 50GB index in the filesystem cache,
> after
> > > that
> > > > queries don't cause disk activity anymore.
> > > > Time improves now queries that took ~30s take <10s. But I hoped
> better
> > > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > > >
> > > >
> > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > > >
> > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > > matteo.gro...@gmail.com>
> > > > > wrote:
> > > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute
> > the
> > > > > > query) is a fraction of a second.
> > > > > > The response in javabin format is around 300k.
> > > > >
> > > > > OK, That tells us a lot.
> > > > > And if you actually tested so that all the docs would be in the
> cache
> > > > > (can you verify this by looking at the cache stats after you
> > > > > re-execute?) then it seems like the slowness is down to any of:
> > > > > a) serializing the response (it doesn't seem like a 300K response
> > > > > should take *that* long to serialize)
> > > > > b) reading/processing the response (how fast the client can do
> > > > > something with each doc is also a factor...)
> > > > > c) other (GC, network, etc)
> > > > >
> > > > > You can try taking client processing out of the equation by trying
> a
> > > > > curl request.
> > > > >
> > > > > -Yonik
> > > > >
> > > >
> > >
> >
>


Re: Select distinct records

2016-02-11 Thread Joel Bernstein
Solr 6.0 supports SELECT DISTINCT (SQL) queries. You can even choose
between a MapReduce implementation and a Json Facet implementation. The
MapReduce Implementation supports extremely high cardinality for the
distinct fields. Json Facet implementation supports lower cardinality but
high QPS.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 11, 2016 at 10:30 AM, Brian Narsi  wrote:

> I am using
>
> Solr 5.1.0
>
> On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal 
> wrote:
>
> > What version of Solr are you using?
> > Have you taken a look at the Collapsing Query Parser. It basically
> performs
> > the same functions as grouping but is much more efficient at doing it.
> > Take a look here:
> >
> >
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> >
> > On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi  wrote:
> >
> > > I am trying to select distinct records from a collection. (I need
> > distinct
> > > name and corresponding id)
> > >
> > > I have tried using grouping and group format of simple but that takes a
> > > long time to execute and sometimes runs into out of memory exception.
> > > Another limitation seems to be that total number of groups are not
> > > returned.
> > >
> > > Is there another faster and more efficient way to do this?
> > >
> > > Thank you
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still
relatively bad. Even 50ms for 10 rows would be considered barely okay.
But... again it depends on query complexity - simple queries should be well
under 50 ms for decent modern hardware.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla 
wrote:

> Hi Jack,
>   response time scale with rows. Relationship doens't seem linear but
> Below 400 rows times are much faster,
> I view query times from solr logs and they are fast
> the same query with rows = 1000 takes 8s
> with rows = 10 takes 0.2s
>
>
> 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
>
> > Are queries scaling linearly - does a query for 100 rows take 1/10th the
> > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> >
> > Does the app need/expect exactly 1,000 documents for the query or is that
> > just what this particular query happened to return?
> >
> > What does they query look like? Is it complex or use wildcards or
> function
> > queries, or is it very simple keywords? How many operators?
> >
> > Have you used the debugQuery=true parameter to see which search
> components
> > are taking the time?
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla 
> > wrote:
> >
> > > Hi Yonic,
> > >  after the first query I find 1000 docs in the document cache.
> > > I'm using curl to send the request and requesting javabin format to
> mimic
> > > the application.
> > > gc activity is low
> > > I managed to load the entire 50GB index in the filesystem cache, after
> > that
> > > queries don't cause disk activity anymore.
> > > Time improves now queries that took ~30s take <10s. But I hoped better
> > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > >
> > >
> > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > >
> > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute
> the
> > > > > query) is a fraction of a second.
> > > > > The response in javabin format is around 300k.
> > > >
> > > > OK, That tells us a lot.
> > > > And if you actually tested so that all the docs would be in the cache
> > > > (can you verify this by looking at the cache stats after you
> > > > re-execute?) then it seems like the slowness is down to any of:
> > > > a) serializing the response (it doesn't seem like a 300K response
> > > > should take *that* long to serialize)
> > > > b) reading/processing the response (how fast the client can do
> > > > something with each doc is also a factor...)
> > > > c) other (GC, network, etc)
> > > >
> > > > You can try taking client processing out of the equation by trying a
> > > > curl request.
> > > >
> > > > -Yonik
> > > >
> > >
> >
>


Re: Logging request times

2016-02-11 Thread Shawn Heisey
On 2/10/2016 10:33 AM, McCallick, Paul wrote:
> We’re trying to fine tune our query and ingestion performance and would like 
> to get more metrics out of SOLR around this.  We are capturing the standard 
> logs as well as the jetty request logs.  The standard logs get us QTime, 
> which is not a good indication of how long the actual request took to 
> process.  The Jetty request logs only show requests between nodes.  I can’t 
> seem to find the client requests in there.
>
> I’d like to start tracking:
>
>   *   each request to index a document (or batch of documents) and the time 
> it took.
>   *   Each request to execute a query and the time it took.

The Jetty request log will usually include the IP address of the client
making the request.  If IP addresses are included in your log and you
aren't seeing anything from your client address(es), perhaps those
requests are being sent to another node.

Logging elapsed time is also something that the clients can do.  If the
client is using SolrJ, every response object has a "getElapsedTime"
method (and also "getQTime") that would allow the client program to log
the elapsed time without doing its own calculation.  Or the client
program could calculate the elapsed time using whatever facilities are
available in the relevant language.

Thanks,
Shawn



Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2016 at 11:07 AM, Markus Jelsma
 wrote:
> Hi - i was sending the following value for json.facet:
> json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap : 
> "+1DAY", facet:{x : "avg(rank)"}}
>
> I now also notice i didn't include the time field. But adding it gives the 
> same error:
> json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : 
> NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}}

Hmmm, the whole thing is a JSON object, so it needs curly braces
around the whole thing...
json.facet={by_day: [...] }

You may need quotes around the date specs as well (containing slashes,
etc)... not sure if they will be parsed as a single string or not

-Yonik


RE: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Markus Jelsma
Hi - i was sending the following value for json.facet:
json.facet=by_day:{type : range, start : NOW-30DAY/DAY, end : NOW/DAY, gap : 
"+1DAY", facet:{x : "avg(rank)"}}

I now also notice i didn't include the time field. But adding it gives the same 
error:
json.facet=by_day:{type : range, field : time, start : NOW-30DAY/DAY, end : 
NOW/DAY, gap : "+1DAY", facet:{x : "avg(rank)"}}

I must be missing something completely :)

Thanks,
Markus

 
 
-Original message-
> From:Yonik Seeley 
> Sent: Thursday 11th February 2016 16:13
> To: solr-user@lucene.apache.org
> Subject: Re: Json faceting, aggregate numeric field by day?
> 
> On Thu, Feb 11, 2016 at 10:04 AM, Markus Jelsma
>  wrote:
> > Thanks. But this yields an error in FacetModule:
> >
> > java.lang.ClassCastException: java.lang.String cannot be cast to 
> > java.util.Map
> > at 
> > org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100)
> > at 
> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247)
> > at 
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
> > ...
> 
> We don't have the best error reporting yet... can you show what you
> sent for json.facet?
> 
> > Is it supposed to work?
> 
> Yep, there are tests.  Here's an example of calculating percentiles
> per range facet bucket (at the bottom):
> http://yonik.com/percentiles-for-solr-faceting/
> 
> > I also found open issues SOLR-6348 and SOLR-6352 which made me doubt is wat 
> > supported at all.
> 
> Those issues aren't related to the new facet module... you can tell by
> the syntax.
> 
> -Yonik
> 


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Hi Jack,
  response time scale with rows. Relationship doens't seem linear but
Below 400 rows times are much faster,
I view query times from solr logs and they are fast
the same query with rows = 1000 takes 8s
with rows = 10 takes 0.2s


2016-02-11 16:22 GMT+01:00 Jack Krupansky :

> Are queries scaling linearly - does a query for 100 rows take 1/10th the
> time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
>
> Does the app need/expect exactly 1,000 documents for the query or is that
> just what this particular query happened to return?
>
> What does they query look like? Is it complex or use wildcards or function
> queries, or is it very simple keywords? How many operators?
>
> Have you used the debugQuery=true parameter to see which search components
> are taking the time?
>
> -- Jack Krupansky
>
> On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla 
> wrote:
>
> > Hi Yonic,
> >  after the first query I find 1000 docs in the document cache.
> > I'm using curl to send the request and requesting javabin format to mimic
> > the application.
> > gc activity is low
> > I managed to load the entire 50GB index in the filesystem cache, after
> that
> > queries don't cause disk activity anymore.
> > Time improves now queries that took ~30s take <10s. But I hoped better
> > I'm going to use jvisualvm's sampler to analyze where time is spent
> >
> >
> > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> >
> > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> matteo.gro...@gmail.com>
> > > wrote:
> > > > Thanks Toke, yes, they are long times, and solr qtime (to execute the
> > > > query) is a fraction of a second.
> > > > The response in javabin format is around 300k.
> > >
> > > OK, That tells us a lot.
> > > And if you actually tested so that all the docs would be in the cache
> > > (can you verify this by looking at the cache stats after you
> > > re-execute?) then it seems like the slowness is down to any of:
> > > a) serializing the response (it doesn't seem like a 300K response
> > > should take *that* long to serialize)
> > > b) reading/processing the response (how fast the client can do
> > > something with each doc is also a factor...)
> > > c) other (GC, network, etc)
> > >
> > > You can try taking client processing out of the equation by trying a
> > > curl request.
> > >
> > > -Yonik
> > >
> >
>


Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-11 Thread Steven White
For my application, the solution I implemented is I log the chunk that
failed into a file.  This file is than post processed one record at a
time.  The ones that fail, are reported to the admin and never looked at
again until the admin takes action.  This is not the most efficient
solution right now but I intend to refactor this code so that the failed
chunk is itself re-processed in smaller chunks till the chunk with the
failed record(s) is down to 1 record "chunk" that will fail.

Like Debraj, I would love to hear from others how they handle such failures.

Steve


On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna 
wrote:

> Thanks Erik. How do people handle this scenario? Right now the only option
> I can think of is to replay the entire batch by doing add for every single
> doc. Then this will give me error for all the docs which got added from the
> batch.
>
> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson 
> wrote:
>
> > This has been a long standing issue, Hoss is doing some current work on
> it
> > see:
> > https://issues.apache.org/jira/browse/SOLR-445
> >
> > But the short form is "no, not yet".
> >
> > Best,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna 
> > wrote:
> > > Hi,
> > >
> > >
> > >
> > > I have a Document Centric Versioning Constraints added in solr schema:-
> > >
> > > 
> > >   false
> > >   doc_version
> > > 
> > >
> > > I am adding multiple documents in solr in a single call using SolrJ
> 5.2.
> > > The code fragment looks something like below :-
> > >
> > >
> > > try {
> > > UpdateResponse resp = solrClient.add(docs.getDocCollection(),
> > > 500);
> > > if (resp.getStatus() != 0) {
> > > throw new Exception(new StringBuilder(
> > > "Failed to add docs in solr ").append(resp.toString())
> > > .toString());
> > > }
> > > } catch (Exception e) {
> > > logError("Adding docs to solr failed", e);
> > > }
> > >
> > >
> > > If one of the document is violating the versioning constraints then
> Solr
> > is
> > > returning an exception with error message like "user version is not
> high
> > > enough: 1454587156" & the other documents are getting added perfectly.
> Is
> > > there a way I can know which document is violating the constraints
> either
> > > in Solr logs or from the Update response returned by Solr?
> > >
> > > Thanks
> >
>


Re: Select distinct records

2016-02-11 Thread Brian Narsi
I am using

Solr 5.1.0

On Thu, Feb 11, 2016 at 9:19 AM, Binoy Dalal  wrote:

> What version of Solr are you using?
> Have you taken a look at the Collapsing Query Parser. It basically performs
> the same functions as grouping but is much more efficient at doing it.
> Take a look here:
>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>
> On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi  wrote:
>
> > I am trying to select distinct records from a collection. (I need
> distinct
> > name and corresponding id)
> >
> > I have tried using grouping and group format of simple but that takes a
> > long time to execute and sometimes runs into out of memory exception.
> > Another limitation seems to be that total number of groups are not
> > returned.
> >
> > Is there another faster and more efficient way to do this?
> >
> > Thank you
> >
> --
> Regards,
> Binoy Dalal
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Are queries scaling linearly - does a query for 100 rows take 1/10th the
time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?

Does the app need/expect exactly 1,000 documents for the query or is that
just what this particular query happened to return?

What does they query look like? Is it complex or use wildcards or function
queries, or is it very simple keywords? How many operators?

Have you used the debugQuery=true parameter to see which search components
are taking the time?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla 
wrote:

> Hi Yonic,
>  after the first query I find 1000 docs in the document cache.
> I'm using curl to send the request and requesting javabin format to mimic
> the application.
> gc activity is low
> I managed to load the entire 50GB index in the filesystem cache, after that
> queries don't cause disk activity anymore.
> Time improves now queries that took ~30s take <10s. But I hoped better
> I'm going to use jvisualvm's sampler to analyze where time is spent
>
>
> 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
>
> > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla 
> > wrote:
> > > Thanks Toke, yes, they are long times, and solr qtime (to execute the
> > > query) is a fraction of a second.
> > > The response in javabin format is around 300k.
> >
> > OK, That tells us a lot.
> > And if you actually tested so that all the docs would be in the cache
> > (can you verify this by looking at the cache stats after you
> > re-execute?) then it seems like the slowness is down to any of:
> > a) serializing the response (it doesn't seem like a 300K response
> > should take *that* long to serialize)
> > b) reading/processing the response (how fast the client can do
> > something with each doc is also a factor...)
> > c) other (GC, network, etc)
> >
> > You can try taking client processing out of the equation by trying a
> > curl request.
> >
> > -Yonik
> >
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Responses have always been slow but previously time was dominated by
faceting.
After few optimization this is my bottleneck.
My suggestion has been to properly implement paging and reduce rows,
unfortunately this is not possible at least not soon

2016-02-11 16:18 GMT+01:00 Jack Krupansky :

> Is this a scenario that was working fine and suddenly deteriorated, or has
> it always been slow?
>
> -- Jack Krupansky
>
> On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla 
> wrote:
>
> > Hi,
> >  I'm trying to optimize a solr application.
> > The bottleneck are queries that request 1000 rows to solr.
> > Unfortunately the application can't be modified at the moment, can you
> > suggest me what could be done on the solr side to increase the
> performance?
> > The bottleneck is just on fetching the results, the query executes very
> > fast.
> > I suggested caching .fdx and .fdt files on the file system cache.
> > Anything else?
> >
> > Thanks
> >
>


Re: Select distinct records

2016-02-11 Thread Binoy Dalal
What version of Solr are you using?
Have you taken a look at the Collapsing Query Parser. It basically performs
the same functions as grouping but is much more efficient at doing it.
Take a look here:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

On Thu, Feb 11, 2016 at 8:44 PM Brian Narsi  wrote:

> I am trying to select distinct records from a collection. (I need distinct
> name and corresponding id)
>
> I have tried using grouping and group format of simple but that takes a
> long time to execute and sometimes runs into out of memory exception.
> Another limitation seems to be that total number of groups are not
> returned.
>
> Is there another faster and more efficient way to do this?
>
> Thank you
>
-- 
Regards,
Binoy Dalal


RE: Running Solr on port 80

2016-02-11 Thread Davis, Daniel (NIH/NLM) [C]
You should edit the files installed by install_solr_service.sh - change the 
init.d script to pass the -p argument to ${SOLRINSTALLDIR}/bin/solr.

By the way, my initscript is modified (a) to support the conventional 
/etc/sysconfig/ convention, and (b) to run solr as a different 
user than the user who owns the jars. In the end, I don't use 
install_solr_service.sh at all, because it expects to have root and what I run 
to provision doesn't have root.

-Original Message-
From: Jeyaprakash Singarayar [mailto:jpsingara...@gmail.com] 
Sent: Thursday, February 11, 2016 2:40 AM
To: solr-user@lucene.apache.org; binoydala...@gmail.com
Subject: Re: Running Solr on port 80

That ok if I'm using it in local, but I'm doing it in a production based on the 
below page

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production



On Thu, Feb 11, 2016 at 12:58 PM, Binoy Dalal 
wrote:

> Why don't you directly run solr from the script provided in 
> {SOLR_DIST}\bin ./solr start -p 8984
>
> On Thu, 11 Feb 2016, 12:56 Jeyaprakash Singarayar 
> 
> wrote:
>
> > Hi,
> >
> > I'm trying to install solr 5.4.1 on CentOS. I know that while 
> > installing Solr as a service in the Linux we can pass -p  > number> to shift the app to host on that port.
> >
> >  ./install_solr_service.sh solr-5.4.1.tgz -p 8984 -f
> >
> > but still it shows as it is hosted on 8983 and not on 8984. Any idea?
> >
> > Waiting up to 30 seconds to see Solr running on port 8983 [/] 
> > Started Solr server on port 8983 (pid=33034). Happy searching!
> >
> > Found 1 Solr nodes:
> >
> > Solr process 33034 running on port 8983 {
> >   "solr_home":"/var/solr/data",
> >   "version":"5.4.1 1725212 - jpountz - 2016-01-18 11:51:45",
> >   "startTime":"2016-02-11T07:25:03.996Z",
> >   "uptime":"0 days, 0 hours, 0 minutes, 11 seconds",
> >   "memory":"68 MB (%13.9) of 490.7 MB"}
> >
> > Service solr installed.
> >
> --
> Regards,
> Binoy Dalal
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Is this a scenario that was working fine and suddenly deteriorated, or has
it always been slow?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla 
wrote:

> Hi,
>  I'm trying to optimize a solr application.
> The bottleneck are queries that request 1000 rows to solr.
> Unfortunately the application can't be modified at the moment, can you
> suggest me what could be done on the solr side to increase the performance?
> The bottleneck is just on fetching the results, the query executes very
> fast.
> I suggested caching .fdx and .fdt files on the file system cache.
> Anything else?
>
> Thanks
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
[image: Immagine incorporata 1]

2016-02-11 16:05 GMT+01:00 Matteo Grolla :

> I see a lot of time spent in splitOnTokens
>
> which is called by (last part of stack trace)
>
> BinaryResponseWriter$Resolver.writeResultsBody()
> ...
> solr.search.ReturnsField.wantsField()
> commons.io.FileNameUtils.wildcardmatch()
> commons.io.FileNameUtils.splitOnTokens()
>
>
>
> 2016-02-11 15:42 GMT+01:00 Matteo Grolla :
>
>> Hi Yonic,
>>  after the first query I find 1000 docs in the document cache.
>> I'm using curl to send the request and requesting javabin format to mimic
>> the application.
>> gc activity is low
>> I managed to load the entire 50GB index in the filesystem cache, after
>> that queries don't cause disk activity anymore.
>> Time improves now queries that took ~30s take <10s. But I hoped better
>> I'm going to use jvisualvm's sampler to analyze where time is spent
>>
>>
>> 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
>>
>>> On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla 
>>> wrote:
>>> > Thanks Toke, yes, they are long times, and solr qtime (to execute the
>>> > query) is a fraction of a second.
>>> > The response in javabin format is around 300k.
>>>
>>> OK, That tells us a lot.
>>> And if you actually tested so that all the docs would be in the cache
>>> (can you verify this by looking at the cache stats after you
>>> re-execute?) then it seems like the slowness is down to any of:
>>> a) serializing the response (it doesn't seem like a 300K response
>>> should take *that* long to serialize)
>>> b) reading/processing the response (how fast the client can do
>>> something with each doc is also a factor...)
>>> c) other (GC, network, etc)
>>>
>>> You can try taking client processing out of the equation by trying a
>>> curl request.
>>>
>>> -Yonik
>>>
>>
>>
>


Select distinct records

2016-02-11 Thread Brian Narsi
I am trying to select distinct records from a collection. (I need distinct
name and corresponding id)

I have tried using grouping and group format of simple but that takes a
long time to execute and sometimes runs into out of memory exception.
Another limitation seems to be that total number of groups are not returned.

Is there another faster and more efficient way to do this?

Thank you


Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2016 at 10:04 AM, Markus Jelsma
 wrote:
> Thanks. But this yields an error in FacetModule:
>
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
> at 
> org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
> ...

We don't have the best error reporting yet... can you show what you
sent for json.facet?

> Is it supposed to work?

Yep, there are tests.  Here's an example of calculating percentiles
per range facet bucket (at the bottom):
http://yonik.com/percentiles-for-solr-faceting/

> I also found open issues SOLR-6348 and SOLR-6352 which made me doubt is wat 
> supported at all.

Those issues aren't related to the new facet module... you can tell by
the syntax.

-Yonik


RE: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Markus Jelsma
Thanks. But this yields an error in FacetModule:

java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
at 
org.apache.solr.search.facet.FacetModule.prepare(FacetModule.java:100)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:247)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073)
...

Is it supposed to work? I also found open issues SOLR-6348 and SOLR-6352 which 
made me doubt is wat supported at all.

Thanks,
Markus

[1]: https://issues.apache.org/jira/browse/SOLR-6348
[2]: https://issues.apache.org/jira/browse/SOLR-6352
 
 
-Original message-
> From:Yonik Seeley 
> Sent: Thursday 11th February 2016 15:11
> To: solr-user@lucene.apache.org
> Subject: Re: Json faceting, aggregate numeric field by day?
> 
> On Wed, Feb 10, 2016 at 5:21 AM, Markus Jelsma
>  wrote:
> > Hi - if we assume the following simple documents:
> >
> > 
> >   2015-01-01T00:00:00Z
> >   2
> > 
> > 
> >   2015-01-01T00:00:00Z
> >   4
> > 
> > 
> >   2015-01-02T00:00:00Z
> >   3
> > 
> > 
> >   2015-01-02T00:00:00Z
> >   7
> > 
> >
> > Can i get a daily average for the field 'value' by day? e.g.
> >
> > 
> >   3.0
> >   5.0
> > 
> 
> For the JSON Facet API, I guess this would be:
> 
> json.facet=
> 
> by_day : {
>   type : range,
>   start : ...,
>   end : ...,
>   gap : "+1DAY",
>   facet : {
> x : "avg(value)"
>   }
> }
> 
> 
> -Yonik
> 


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
I see a lot of time spent in splitOnTokens

which is called by (last part of stack trace)

BinaryResponseWriter$Resolver.writeResultsBody()
...
solr.search.ReturnsField.wantsField()
commons.io.FileNameUtils.wildcardmatch()
commons.io.FileNameUtils.splitOnTokens()



2016-02-11 15:42 GMT+01:00 Matteo Grolla :

> Hi Yonic,
>  after the first query I find 1000 docs in the document cache.
> I'm using curl to send the request and requesting javabin format to mimic
> the application.
> gc activity is low
> I managed to load the entire 50GB index in the filesystem cache, after
> that queries don't cause disk activity anymore.
> Time improves now queries that took ~30s take <10s. But I hoped better
> I'm going to use jvisualvm's sampler to analyze where time is spent
>
>
> 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
>
>> On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla 
>> wrote:
>> > Thanks Toke, yes, they are long times, and solr qtime (to execute the
>> > query) is a fraction of a second.
>> > The response in javabin format is around 300k.
>>
>> OK, That tells us a lot.
>> And if you actually tested so that all the docs would be in the cache
>> (can you verify this by looking at the cache stats after you
>> re-execute?) then it seems like the slowness is down to any of:
>> a) serializing the response (it doesn't seem like a 300K response
>> should take *that* long to serialize)
>> b) reading/processing the response (how fast the client can do
>> something with each doc is also a factor...)
>> c) other (GC, network, etc)
>>
>> You can try taking client processing out of the equation by trying a
>> curl request.
>>
>> -Yonik
>>
>
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla  wrote:
> Hi Yonic,
>  after the first query I find 1000 docs in the document cache.
> I'm using curl to send the request and requesting javabin format to mimic
> the application.
> gc activity is low
> I managed to load the entire 50GB index in the filesystem cache, after that
> queries don't cause disk activity anymore.
> Time improves now queries that took ~30s take <10s. But I hoped better
> I'm going to use jvisualvm's sampler to analyze where time is spent

Thanks, please keep us posted... something is definitely strange.

-Yonik


Re: Solr architecture

2016-02-11 Thread Upayavira
Your biggest issue here is likely to be http connections. Making an HTTP
connection to Solr is way more expensive than the ask of adding a single
document to the index. If you are expecting to add 24 billion docs per
day, I'd suggest that somehow merging those documents into batches
before sending them to Solr will be necessary.

To my previous question - what do you gain by using Solr that you don't
get from other solutions? I'd suggest that to make this system really
work, you are going to need a deep understanding of how Lucene works -
segments, segment merges, deletions, and many other things because when
you start to work at that scale, the implementation details behind
Lucene really start to matter and impact upon your ability to succeed.

I'd suggest that what you are undertaking can certainly be done, but is
a substantial project.

Upayavira

On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote:
> Thanks everyone for your suggestions.
> Based on it I am planning to have one doc per event with sessionId
> common.
> 
> So in this case hopefully indexing each doc as and when it comes would be
> okay? Or do we still need to batch and index to Solr?
> 
> Also with 4M sessions a day with about 6000 docs (events) per session we
> can expect about 24Billion docs per day!
> 
> Will Solr still hold good. If so could some one please recommend a sizing
> to cater to this levels of data.
> The queries per second is around 320 qps.
> 
> Thanks!
> Mark
> 
> 
> On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
> 
> > Hi Mark,
> > Appending session actions just to be able to return more than one session
> > without retrieving large number of results is not good tradeoff. Like
> > Upayavira suggested, you should consider storing one action per doc and
> > aggregate on read time or push to Solr once session ends and aggregate on
> > some other layer.
> > If you are thinking handling infrastructure might be too much, you may
> > consider using some of logging services to hold data. One such service is
> > Sematext's Logsene (http://sematext.com/logsene).
> >
> > Thanks,
> > Emir
> >
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > On 10.02.2016 03:22, Mark Robinson wrote:
> >
> >> Thanks for your replies and suggestions!
> >>
> >> Why I store all events related to a session under one doc?
> >> Each session can have about 500 total entries (events) corresponding to
> >> it.
> >> So when I try to retrieve a session's info it can back with around 500
> >> records. If it is this compounded one doc per session, I can retrieve more
> >> sessions at a time with one doc per session.
> >> eg under a sessionId an array of eventA activities, eventB activities
> >>   (using json). When an eventA activity again occurs, we will read all
> >> that
> >> data for that session, append this extra info to evenA data and push the
> >> whole session related data back (indexing) to Solr. Like this for many
> >> sessions parallely.
> >>
> >>
> >> Why NRT?
> >> Parallely many sessions are being written (4Million sessions hence
> >> 4Million
> >> docs per day). A person can do this querying any time.
> >>
> >> It is just a look up?
> >> Yes. We just need to retrieve all info for a session and pass it on to
> >> another system. We may even do some extra querying on some data like
> >> timestamps, pageurl etc in that info added to a session.
> >>
> >> Thinking of having the data separate from the actual Solr Instance and
> >> mention the loc of the dataDir in solrconfig.
> >>
> >> If Solr is not a good option could you please suggest something which will
> >> satisfy this use case with min response time while querying.
> >>
> >> Thanks!
> >> Mark
> >>
> >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins 
> >> wrote:
> >>
> >> So as I understand your use case, its effectively logging actions within a
> >>> user session, why do you have to do the update in NRT?  Why not just log
> >>> all the user session events (with some unique key, and ensuring the
> >>> session
> >>> Id is in the document somewhere), then when you want to do the query, you
> >>> join on the session id, and that gives you all the data records for that
> >>> session. I don't really follow why it has to be 1 document (which you
> >>> continually update). If you really need that aggregation, couldn't that
> >>> happen offline?
> >>>
> >>> I guess your 1 saving grace is that you query using the unique ID (in
> >>> your
> >>> scenario) so you could use the real-time get handler, since you aren't
> >>> doing a complex query (strictly its not a search, its a raw key lookup).
> >>>
> >>> But I would still question your use case, if you go the Solr route for
> >>> that
> >>> kind of scale with querying and indexing that much, you're going to have
> >>> to
> >>> throw a lot of hardware at it, as Jack says probably in the order of
> >>> hundreds

Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Hi Yonic,
 after the first query I find 1000 docs in the document cache.
I'm using curl to send the request and requesting javabin format to mimic
the application.
gc activity is low
I managed to load the entire 50GB index in the filesystem cache, after that
queries don't cause disk activity anymore.
Time improves now queries that took ~30s take <10s. But I hoped better
I'm going to use jvisualvm's sampler to analyze where time is spent


2016-02-11 15:25 GMT+01:00 Yonik Seeley :

> On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla 
> wrote:
> > Thanks Toke, yes, they are long times, and solr qtime (to execute the
> > query) is a fraction of a second.
> > The response in javabin format is around 300k.
>
> OK, That tells us a lot.
> And if you actually tested so that all the docs would be in the cache
> (can you verify this by looking at the cache stats after you
> re-execute?) then it seems like the slowness is down to any of:
> a) serializing the response (it doesn't seem like a 300K response
> should take *that* long to serialize)
> b) reading/processing the response (how fast the client can do
> something with each doc is also a factor...)
> c) other (GC, network, etc)
>
> You can try taking client processing out of the equation by trying a
> curl request.
>
> -Yonik
>


Re: multiple but identical suggestions in autocomplete

2016-02-11 Thread Alessandro Benedetti
Related this, I just created this :
https://issues.apache.org/jira/browse/SOLR-8672

To be fair, I see no utility in returning duplicate suggestions ( if they
have no different payload, they are un-distinguishable from a human
perspective hence useless to have duplication) .
I would like to hear some counter example.
In my opinion we should contribute a way to avoid the duplicates directly
in Solr.
If there are valid counter examples, we could add an additional parameter
for the solr.SuggestComponent like  boolean
 .
In a lot of scenarios I guess it could be a good fit.
Cheers

On 5 August 2015 at 12:06, Nutch Solr User  wrote:

> You will need to call this service from UI as you are calling suggester
> component currently. (may be on every key-press event in text box). You
> will
> pass required parameters too.
>
> Service will internally form a solr suggester query and query Solr. From
> the
> returned response it will keep only unique suggestions from top N
> suggestions and return suggestions to UI.
>
>
>
> -
> Nutch Solr User
>
> "The ultimate search engine would basically understand everything in the
> world, and it would always give you the right thing."
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/multiple-but-identical-suggestions-in-autocomplete-tp4220055p4220953.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla  wrote:
> Thanks Toke, yes, they are long times, and solr qtime (to execute the
> query) is a fraction of a second.
> The response in javabin format is around 300k.

OK, That tells us a lot.
And if you actually tested so that all the docs would be in the cache
(can you verify this by looking at the cache stats after you
re-execute?) then it seems like the slowness is down to any of:
a) serializing the response (it doesn't seem like a 300K response
should take *that* long to serialize)
b) reading/processing the response (how fast the client can do
something with each doc is also a factor...)
c) other (GC, network, etc)

You can try taking client processing out of the equation by trying a
curl request.

-Yonik


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Alessandro Benedetti
Hi Matteo,
as an addition to Upayavira observation, how is the memory assigned for
that Solr Instance ?
How much memory is assigned to Solr and how much left for the OS ?
Is this a VM on top of a physical machine ? So it is the real physical
memory used, or swapping  could happen frequently ?
Is there enough memory to allow the OS to cache the stored content index
segments in memory ?
As a first thing I would try to exclude I/O bottlenecks with the disk ( and
apparently your document cache experiment should exclude them)
Unfortunately the export request handler is not an option with 4.0 .

Are you obtaining those timings in high load ( huge query load) or in low
load timeframes ?
What happens if you take the solr instance all for you an repeat the
experiment?
In an healthy memory mapped scenario i would not expect to half the time of
a single query thanks to the document cache ( of course I would expect a
benefit but it looks too much difference for something that should be
already in RAM).
In a dedicated instance , have you tried without the document cache, to
repeat the query execution ? ( to trigger possibly memory mapping)

But also that should be an alarming point, in a low load Solr, with the
document fields in cache ( so java heap memory), it is impressive we take
14s to load the documents fields.
I am curious to know updates on this,

Cheers

On 11 February 2016 at 12:45, Matteo Grolla  wrote:

> Thanks Toke, yes, they are long times, and solr qtime (to execute the
> query) is a fraction of a second.
> The response in javabin format is around 300k.
> Currently I can't limit the rows requested or the fields requested, those
> are fixed for me.
>
> 2016-02-11 13:14 GMT+01:00 Toke Eskildsen :
>
> > On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote:
> > >  I'm working with solr 4.0, sorting on score (default).
> > > I tried setting the document cache size to 2048, so all docs of a
> single
> > > request fit (2 requests fit actually)
> > > If I execute a query the first time it takes 24s
> > > I reexecute it, with all docs in the documentCache and it takes 15s
> > > execute it with rows = 400 and it takes 3s
> >
> > Those are very long execution times. It sounds like you either have very
> > complex queries or very large fields, as Binoy suggests. Can you provide
> > us with a full sample request and tell us how large a single documnent
> > is when returned? If you do not need all the fields in the returned
> > documents, you should limit them with the fl-parameter.
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
> >
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Yonik Seeley
On Wed, Feb 10, 2016 at 5:21 AM, Markus Jelsma
 wrote:
> Hi - if we assume the following simple documents:
>
> 
>   2015-01-01T00:00:00Z
>   2
> 
> 
>   2015-01-01T00:00:00Z
>   4
> 
> 
>   2015-01-02T00:00:00Z
>   3
> 
> 
>   2015-01-02T00:00:00Z
>   7
> 
>
> Can i get a daily average for the field 'value' by day? e.g.
>
> 
>   3.0
>   5.0
> 

For the JSON Facet API, I guess this would be:

json.facet=

by_day : {
  type : range,
  start : ...,
  end : ...,
  gap : "+1DAY",
  facet : {
x : "avg(value)"
  }
}


-Yonik


Re: Size of logs are high

2016-02-11 Thread Aditya Sundaram
Can you check your log level? Probably log level of error would suffice for
your purpose and it would most certainly reduce your log size(s).

On Thu, Feb 11, 2016 at 12:53 PM, kshitij tyagi  wrote:

> Hi,
> I have migrated to solr 5.2 and the size of logs are high.
>
> Can anyone help me out here how to control this?
>



-- 
Aditya Sundaram
Software Engineer, Technology team
AKR Tech park B Block, B1 047
+91-9844006866


Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Tom Evans
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma
 wrote:
> Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats 
> over ranges is not yet supported. More specifically, SOLR-6352 is what we 
> would need.
>
> [1]: https://issues.apache.org/jira/browse/SOLR-6348
> [2]: https://issues.apache.org/jira/browse/SOLR-6352
>
> Thanks anyway, at least we found the tickets :)
>

No problem - as I was reading this I was thinking "But wait, I *know*
we do this ourselves for average price vs month published". In fact, I
was forgetting that we index the ranges that we will want to facet
over as part of the document - so a document with a date_published of
"2010-03-29T00:00:00Z" also has a date_published.month of "201003"
(and a bunch of other ranges that we want to facet by). The frontend
then converts those fields in to the appropriate values for display.

This might be an acceptable solution for you guys too, depending on
how many ranges that you require, and how much larger it would make
your index.

Cheers

Tom


Re: Solr architecture

2016-02-11 Thread Emir Arnautovic

Hi Mark,
Nothing comes for free :) With doc per action, you will have to handle 
large number of docs. There is hard limit for number of docs per shard - 
it is ~4 billion (size of int) so sharding is mandatory. It is most 
likely that you will have to have more than one collection. Depending on 
your queries, different layouts can be applied. What will be these 320 
qps? Will you do some filtering (by user, country,...), will you focus 
on the latest data, what is your data retention strategy...


You should answer to such questions and decide setup that will handle 
important one in efficient way. With this amount of data you will most 
likely have to do some tradeoffs.


When it comes to sending docs to Solr, sending bulks is mandatory.

Regards,
Emir

On 10.02.2016 22:48, Mark Robinson wrote:

Thanks everyone for your suggestions.
Based on it I am planning to have one doc per event with sessionId common.

So in this case hopefully indexing each doc as and when it comes would be
okay? Or do we still need to batch and index to Solr?

Also with 4M sessions a day with about 6000 docs (events) per session we
can expect about 24Billion docs per day!

Will Solr still hold good. If so could some one please recommend a sizing
to cater to this levels of data.
The queries per second is around 320 qps.

Thanks!
Mark


On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Mark,
Appending session actions just to be able to return more than one session
without retrieving large number of results is not good tradeoff. Like
Upayavira suggested, you should consider storing one action per doc and
aggregate on read time or push to Solr once session ends and aggregate on
some other layer.
If you are thinking handling infrastructure might be too much, you may
consider using some of logging services to hold data. One such service is
Sematext's Logsene (http://sematext.com/logsene).

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 10.02.2016 03:22, Mark Robinson wrote:


Thanks for your replies and suggestions!

Why I store all events related to a session under one doc?
Each session can have about 500 total entries (events) corresponding to
it.
So when I try to retrieve a session's info it can back with around 500
records. If it is this compounded one doc per session, I can retrieve more
sessions at a time with one doc per session.
eg under a sessionId an array of eventA activities, eventB activities
   (using json). When an eventA activity again occurs, we will read all
that
data for that session, append this extra info to evenA data and push the
whole session related data back (indexing) to Solr. Like this for many
sessions parallely.


Why NRT?
Parallely many sessions are being written (4Million sessions hence
4Million
docs per day). A person can do this querying any time.

It is just a look up?
Yes. We just need to retrieve all info for a session and pass it on to
another system. We may even do some extra querying on some data like
timestamps, pageurl etc in that info added to a session.

Thinking of having the data separate from the actual Solr Instance and
mention the loc of the dataDir in solrconfig.

If Solr is not a good option could you please suggest something which will
satisfy this use case with min response time while querying.

Thanks!
Mark

On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins 
wrote:

So as I understand your use case, its effectively logging actions within a

user session, why do you have to do the update in NRT?  Why not just log
all the user session events (with some unique key, and ensuring the
session
Id is in the document somewhere), then when you want to do the query, you
join on the session id, and that gives you all the data records for that
session. I don't really follow why it has to be 1 document (which you
continually update). If you really need that aggregation, couldn't that
happen offline?

I guess your 1 saving grace is that you query using the unique ID (in
your
scenario) so you could use the real-time get handler, since you aren't
doing a complex query (strictly its not a search, its a raw key lookup).

But I would still question your use case, if you go the Solr route for
that
kind of scale with querying and indexing that much, you're going to have
to
throw a lot of hardware at it, as Jack says probably in the order of
hundreds of machines...

On 9 February 2016 at 19:00, Upayavira  wrote:

Bear in mind that Lucene is optimised towards high read lower write.

That is, it puts in a lot of effort at write time to make reading
efficient. It sounds like you are going to be doing far more writing
than reading, and I wonder whether you are necessarily choosing the
right tool for the job.

How would you later use this data, and what advantage is there to
storing it in Solr?

Upayavira

On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:


Hi,
Thanks 

Re: [More Like This] Query building

2016-02-11 Thread Alessandro Benedetti
Hi Guys,
is it possible to have any feedback ?
Is there any process to speed up bug resolution / discussions ?
just want to understand if the patch is not good enough, if I need to
improve it or simply no-one took a look ...

https://issues.apache.org/jira/browse/LUCENE-6954

Cheers

On 11 January 2016 at 15:25, Alessandro Benedetti 
wrote:

> Hi guys,
> the patch seems fine to me.
> I didn't spend much more time on the code but I checked the tests and the
> pre-commit checks.
> It seems fine to me.
> Let me know ,
>
> Cheers
>
> On 31 December 2015 at 18:40, Alessandro Benedetti 
> wrote:
>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> First draft patch available, I will check better the tests new year !
>>
>> On 29 December 2015 at 13:43, Alessandro Benedetti > > wrote:
>>
>>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>>
>>> In the meantime let's try to collect some additional feedback.
>>>
>>> Cheers
>>>
>>> On 29 December 2015 at 12:43, Anshum Gupta 
>>> wrote:
>>>
 Feel free to create a JIRA and put up a patch if you can.

 On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
 abenede...@apache.org
 > wrote:

 > Hi guys,
 > While I was exploring the way we build the More Like This query, I
 > discovered a part I am not convinced of :
 >
 >
 >
 > Let's see how we build the query :
 > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
 >
 > 1) we extract the terms from the interesting fields, adding them to a
 map :
 >
 > Map termFreqMap = new HashMap<>();
 >
 > *( we lose the relation field-> term, we don't know anymore where the
 term
 > was coming ! )*
 >
 > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
 >
 > 2) we build the queue that will contain the query terms, at this
 point we
 > connect again there terms to some field, but :
 >
 > ...
 >> // go through all the fields and find the largest document frequency
 >> String topField = fieldNames[0];
 >> int docFreq = 0;
 >> for (String fieldName : fieldNames) {
 >>   int freq = ir.docFreq(new Term(fieldName, word));
 >>   topField = (freq > docFreq) ? fieldName : topField;
 >>   docFreq = (freq > docFreq) ? freq : docFreq;
 >> }
 >> ...
 >
 >
 > We identify the topField as the field with the highest document
 frequency
 > for the term t .
 > Then we build the termQuery :
 >
 > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
 >
 > In this way we lose a lot of precision.
 > Not sure why we do that.
 > I would prefer to keep the relation between terms and fields.
 > The MLT query can improve a lot the quality.
 > If i run the MLT on 2 fields : *description* and *facilities* for
 example.
 > It is likely I want to find documents with similar terms in the
 > description and similar terms in the facilities, without mixing up the
 > things and loosing the semantic of the terms.
 >
 > Let me know your opinion,
 >
 > Cheers
 >
 >
 > --
 > --
 >
 > Benedetti Alessandro
 > Visiting card : http://about.me/alessandro_benedetti
 >
 > "Tyger, tyger burning bright
 > In the forests of the night,
 > What immortal hand or eye
 > Could frame thy fearful symmetry?"
 >
 > William Blake - Songs of Experience -1794 England
 >



 --
 Anshum Gupta

>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Thanks Toke, yes, they are long times, and solr qtime (to execute the
query) is a fraction of a second.
The response in javabin format is around 300k.
Currently I can't limit the rows requested or the fields requested, those
are fixed for me.

2016-02-11 13:14 GMT+01:00 Toke Eskildsen :

> On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote:
> >  I'm working with solr 4.0, sorting on score (default).
> > I tried setting the document cache size to 2048, so all docs of a single
> > request fit (2 requests fit actually)
> > If I execute a query the first time it takes 24s
> > I reexecute it, with all docs in the documentCache and it takes 15s
> > execute it with rows = 400 and it takes 3s
>
> Those are very long execution times. It sounds like you either have very
> complex queries or very large fields, as Binoy suggests. Can you provide
> us with a full sample request and tell us how large a single documnent
> is when returned? If you do not need all the fields in the returned
> documents, you should limit them with the fl-parameter.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Toke Eskildsen
On Thu, 2016-02-11 at 11:53 +0100, Matteo Grolla wrote:
>  I'm working with solr 4.0, sorting on score (default).
> I tried setting the document cache size to 2048, so all docs of a single
> request fit (2 requests fit actually)
> If I execute a query the first time it takes 24s
> I reexecute it, with all docs in the documentCache and it takes 15s
> execute it with rows = 400 and it takes 3s

Those are very long execution times. It sounds like you either have very
complex queries or very large fields, as Binoy suggests. Can you provide
us with a full sample request and tell us how large a single documnent
is when returned? If you do not need all the fields in the returned
documents, you should limit them with the fl-parameter.

- Toke Eskildsen, State and University Library, Denmark




Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Hi Upayavira,
 I'm working with solr 4.0, sorting on score (default).
I tried setting the document cache size to 2048, so all docs of a single
request fit (2 requests fit actually)
If I execute a query the first time it takes 24s
I reexecute it, with all docs in the documentCache and it takes 15s
execute it with rows = 400 and it takes 3s

it seems that below rows = 400 times are acceptable, beyond they get slow

2016-02-11 11:27 GMT+01:00 Upayavira :

>
>
> On Thu, Feb 11, 2016, at 09:33 AM, Matteo Grolla wrote:
> > Hi,
> >  I'm trying to optimize a solr application.
> > The bottleneck are queries that request 1000 rows to solr.
> > Unfortunately the application can't be modified at the moment, can you
> > suggest me what could be done on the solr side to increase the
> > performance?
> > The bottleneck is just on fetching the results, the query executes very
> > fast.
> > I suggested caching .fdx and .fdt files on the file system cache.
> > Anything else?
>
> The index files will automatically be cached in the OS disk cache
> without any intervention, so that can't be the issue.
>
> How are you sorting the results? Are you letting it calculate scores?
> 1000 rows shouldn't be particularly expensive, beyond the unavoidable
> network cost.
>
> Have you considered using the /export endpoint and the streaming API? I
> haven't used it myself, but it is intended for getting larger amounts of
> data out of a Solr index.
>
> Upayavira
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Upayavira


On Thu, Feb 11, 2016, at 09:33 AM, Matteo Grolla wrote:
> Hi,
>  I'm trying to optimize a solr application.
> The bottleneck are queries that request 1000 rows to solr.
> Unfortunately the application can't be modified at the moment, can you
> suggest me what could be done on the solr side to increase the
> performance?
> The bottleneck is just on fetching the results, the query executes very
> fast.
> I suggested caching .fdx and .fdt files on the file system cache.
> Anything else?

The index files will automatically be cached in the OS disk cache
without any intervention, so that can't be the issue.

How are you sorting the results? Are you letting it calculate scores?
1000 rows shouldn't be particularly expensive, beyond the unavoidable
network cost.

Have you considered using the /export endpoint and the streaming API? I
haven't used it myself, but it is intended for getting larger amounts of
data out of a Solr index.

Upayavira


Are fieldCache and/or DocValues used by Function Queries

2016-02-11 Thread Andrea Roggerone
Hi,
I need to evaluate different boost solutions performance and I can't find
any relevant documentation about it. Are fieldCache and/or DocValues used
by Function Queries?


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Binoy Dalal
If you're fetching large text fields, consider highlighting on them and
just returning the snippets. I faced such a problem some time ago and
highlighting sped things up nearly 10x for us.

On Thu, 11 Feb 2016, 15:03 Matteo Grolla  wrote:

> Hi,
>  I'm trying to optimize a solr application.
> The bottleneck are queries that request 1000 rows to solr.
> Unfortunately the application can't be modified at the moment, can you
> suggest me what could be done on the solr side to increase the performance?
> The bottleneck is just on fetching the results, the query executes very
> fast.
> I suggested caching .fdx and .fdt files on the file system cache.
> Anything else?
>
> Thanks
>
-- 
Regards,
Binoy Dalal


optimize requests that fetch 1000 rows

2016-02-11 Thread Matteo Grolla
Hi,
 I'm trying to optimize a solr application.
The bottleneck are queries that request 1000 rows to solr.
Unfortunately the application can't be modified at the moment, can you
suggest me what could be done on the solr side to increase the performance?
The bottleneck is just on fetching the results, the query executes very
fast.
I suggested caching .fdx and .fdt files on the file system cache.
Anything else?

Thanks


Re: Running Solr on port 80

2016-02-11 Thread Binoy Dalal
The script essentially automates what you would do manually, for the first
time when starting up the system.
It is no different from extracting the archive, setting permissions etc.
yourself.
So the next time you wanted to stop/ restart solr, you'll have to do it
using the solr script.

That being said, I see that you've included a -f option along with your
command. Is that a typo? The script file doesn't have a -f option.

On Thu, 11 Feb 2016, 13:09 Jeyaprakash Singarayar 
wrote:

> That ok if I'm using it in local, but I'm doing it in a production based
> on the below page
>
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
>
>
>
> On Thu, Feb 11, 2016 at 12:58 PM, Binoy Dalal 
> wrote:
>
>> Why don't you directly run solr from the script provided in
>> {SOLR_DIST}\bin
>> ./solr start -p 8984
>>
>> On Thu, 11 Feb 2016, 12:56 Jeyaprakash Singarayar > >
>> wrote:
>>
>> > Hi,
>> >
>> > I'm trying to install solr 5.4.1 on CentOS. I know that while installing
>> > Solr as a service in the Linux we can pass -p  to shift the
>> > app to host on that port.
>> >
>> >  ./install_solr_service.sh solr-5.4.1.tgz -p 8984 -f
>> >
>> > but still it shows as it is hosted on 8983 and not on 8984. Any idea?
>> >
>> > Waiting up to 30 seconds to see Solr running on port 8983 [/]
>> > Started Solr server on port 8983 (pid=33034). Happy searching!
>> >
>> > Found 1 Solr nodes:
>> >
>> > Solr process 33034 running on port 8983
>> > {
>> >   "solr_home":"/var/solr/data",
>> >   "version":"5.4.1 1725212 - jpountz - 2016-01-18 11:51:45",
>> >   "startTime":"2016-02-11T07:25:03.996Z",
>> >   "uptime":"0 days, 0 hours, 0 minutes, 11 seconds",
>> >   "memory":"68 MB (%13.9) of 490.7 MB"}
>> >
>> > Service solr installed.
>> >
>> --
>> Regards,
>> Binoy Dalal
>>
>
> --
Regards,
Binoy Dalal


both way synonyms with ManagedSynonymFilterFactory

2016-02-11 Thread Bjørn Hjelle
Hi,

one-way managed synonyms seems to work fine, but I cannot make both-way
synonyms work.

Steps to reproduce with Solr 5.4.1:

1. create a core:
$ bin/solr create_core -c test -d server/solr/configsets/basic_configs

2. edit schema.xml so fieldType text_general looks like this:


  



  


3. reload the core:

$ curl -X GET "
http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";

4. add synonyms, one one-way synonym, one two-way, reload the core again:

$ curl -X PUT -H 'Content-type:application/json' --data-binary
'{"mad":["angry","upset"]}' "
http://localhost:8983/solr/test/schema/analysis/synonyms/english";
$ curl -X PUT -H 'Content-type:application/json' --data-binary
'["mb","megabytes"]' "
http://localhost:8983/solr/test/schema/analysis/synonyms/english";
 $ curl -X GET "
http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";

5. list the synonyms:
{
  "responseHeader":{
"status":0,
"QTime":0},
  "synonymMappings":{
"initArgs":{"ignoreCase":false},
"initializedOn":"2016-02-11T09:00:50.354Z",
"managedMap":{
  "mad":["angry",
"upset"],
  "mb":["megabytes"],
  "megabytes":["mb"]}}}


6. add two documents:

$ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t" :
"10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be
sufficient" }]'
$ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t" :
"100 mb should be sufficient" }]'

7. search for the documents:

- all these return the first document, so one-way synonyms work:
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:angry&indent=true";
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:upset&indent=true";
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:mad&indent=true";

- this only returns the document with "mb":

$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:mb&indent=true";

- this only returns the document with "megabytes"

$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true";


Any input on how to make this work would be appreciated.

Thanks,
Bjørn