Re: Soft commit and reading data just after the commit

2016-12-19 Thread Ere Maijala

Hi,

so, the app already has a database connection because it updates the 
READ flag when the user clicks an entry, right? If you only need the 
flag for display purposes, it sounds like it would make sense to also 
fetch it directly from the database when displaying the listing. Of 
course if you also need to search for READ/UNREAD you need to index the 
change, but perhaps you could get away with it taking longer.


--Ere

20.12.2016, 4.12, Lasitha Wattaladeniya kirjoitti:

Hi Shawn,

Thanks for your well detailed explanation. Now I understand, I won't be
able to achieve the 100ms softcommit timeout with my hardware setup.
However let's say someone has a requirement as below (quoted from my
previous mail)

*Requirement *is,  we are showing a list of entries on a page. For each
user there's a read / unread flag.  The data for listing is fetched from
solr. And you can see the entry was previously read or not. So when a user
views an entry by clicking.  We are updating the database flag to READ and
use real time indexing to update solr index.  So when the user close the
full view of the entry and go back to entry listing page,  the data fetched
from solr should be updated to READ.

Can't we achieve a requirement as described above using solr ? (without
manipulating the previously fetched results list from solr, because at some
point we'll have to go back to search results from solr and at that time it
should be updated).

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Mon, Dec 19, 2016 at 6:37 PM, Shawn Heisey  wrote:


On 12/18/2016 7:09 PM, Lasitha Wattaladeniya wrote:

@eric : thanks for the lengthy reply. So let's say I increase the
autosoftcommit time out to may be 100 ms. In that case do I have to
wait much that time from client side before calling search ?. What's
the correct way of achieving this?


Some of the following is covered by the links you've already received.
Some of it may be new information.

Before you can see a change you've just made, you will need to wait for
the commit to be fired (in this case, the autoSoftCommit time) plus
however long it actually takes to complete the commit and open a new
searcher.  Opening the searcher is the expensive part.

What I typically recommend that people do is have the autoSoftCommit
time as long as they can stand, with 60-300 seconds as a "typical"
value.  That's a setting of 6 to 30.  What you are trying to
achieve is much faster, and much more difficult.

100 milliseconds will typically be far too small a value unless your
index is extremely small or your hardware is incredibly fast and has a
lot of memory.  With a value of 100, you'll want each of those soft
commits (which do open a new searcher) to take FAR less than 100
milliseconds to complete.  This kind of speed can be difficult to
achieve, especially if the index is large.

To have any hope of fast commit times, you will need to set
autowarmCount on all Solr caches to zero.  If you are indexing
frequently enough, you might even want to completely disable Solr's
internal caches, because they may be providing no benefit.

You will want to have enough extra memory that your operating system can
cache the vast majority (or even maybe all) of your index.

https://wiki.apache.org/solr/SolrPerformanceProblems

Some other info that's helpful for understanding why plenty of *spare*
memory (not allocated by programs) is necessary for good performance:

https://en.wikipedia.org/wiki/Page_cache
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

The reason in a nutshell:  Disks are EXTREMELY slow.  Memory is very fast.

Thanks,
Shawn






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Soft commit and reading data just after the commit

2016-12-19 Thread Walter Underwood
You probably need a database instead of a search engine.

What requirement makes you want to do this with a search engine?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 19, 2016, at 6:34 PM, Lasitha Wattaladeniya  wrote:
> 
> Hi Hendrik,
> 
> Thanks for your input. Previously I was using the hard commit
> (SolrClient.commit()) but then I got some error when there are concurrent
> real time index requests from my app. The error was  "Exceeded limit of
> maxWarmingSearchers=2, try again later", then i changed the code to use
> only solrserver.add(docs) method and configured autoSoftCommit timeout and
> autoCommit timeout in solrConfig.
> 
> I think, i'll get the same error when I use the method you described
> (SolrClient.commit(String
> collection, boolean waitFlush, boolean waitSearcher, boolean softCommit)),
> Anyway i'll have a look at that method.
> 
> Best regards,
> Lasitha
> 
> Lasitha Wattaladeniya
> Software Engineer
> 
> Mobile : +6593896893 <+65%209389%206893>
> Blog : techreadme.blogspot.com
> 
> On Tue, Dec 20, 2016 at 3:31 AM, Hendrik Haddorp 
> wrote:
> 
>> Hi,
>> 
>> the SolrJ API has this method: SolrClient.commit(String collection,
>> boolean waitFlush, boolean waitSearcher, boolean softCommit).
>> My assumption so far was that when you set waitSearcher to true that the
>> method call only returns once a search would find the new data, which
>> sounds what you want. I used this already and it seemed to work just fine.
>> 
>> regards,
>> Hendrik
>> 
>> 
>> On 19.12.2016 04:09, Lasitha Wattaladeniya wrote:
>> 
>>> Hi all,
>>> 
>>> Thanks for your replies,
>>> 
>>> @dorian : the requirement is,  we are showing a list of entries on a page.
>>> For each user there's a read / unread flag.  The data for listing is
>>> fetched from solr. And you can see the entry was previously read or not.
>>> So
>>> when a user views an entry by clicking.  We are updating the database flag
>>> to READ and use real time indexing to update solr entry.  So when the user
>>> close the full view of the entry and go back to entry listing page,  the
>>> data fetched from solr should be updated to READ. That's the use case we
>>> are trying to fix.
>>> 
>>> @eric : thanks for the lengthy reply.  So let's say I increase the
>>> autosoftcommit time out to may be 100 ms.  In that case do I have to wait
>>> much that time from client side before calling search ?.  What's the
>>> correct way of achieving this?
>>> 
>>> Regards,
>>> Lasitha
>>> 
>>> On 18 Dec 2016 23:52, "Erick Erickson"  wrote:
>>> 
>>> 1 ms autocommit is far too frequent. And it's not
 helping you anyway.
 
 There is some lag between when a commit happens
 and when the docs are really available. The sequence is:
 1> commit (soft or hard-with-opensearcher=true doesn't matter).
 2> a new searcher is opened and autowarming starts
 3> until the new searcher is opened, queries continue to be served by
 the old searcher
 4> the new searcher is fully opened
 5> _new_ requests are served by the new searcher.
 6> the last request is finished by the old searcher and it's closed.
 
 So what's probably happening is that you send docs and then send a
 query and Solr is still in step <3>. You can look at your admin UI
 pluginst/stats page or your log to see how long it takes for a
 searcher to open and adjust your expectations accordingly.
 
 If you want to fetch only the document (not try to get it by a
 search), Real Time Get is designed to insure that you always get the
 most recent copy whether it's searchable or not.
 
 All that said, Solr wasn't designed for autocommits that are that
 frequent. That's why the documentation talks about _Near_ Real Time.
 You may need to adjust your expectations.
 
 Best,
 Erick
 
 On Sun, Dec 18, 2016 at 6:49 AM, Dorian Hoxha 
 wrote:
 
> There's a very high probability that you're using the wrong tool for the
> job if you need 1ms softCommit time. Especially when you always need it
> 
 (ex
 
> there are apps where you need commit-after-insert very rarely).
> 
> So explain what you're using it for ?
> 
> On Sun, Dec 18, 2016 at 3:38 PM, Lasitha Wattaladeniya <
> 
 watt...@gmail.com>
 
> wrote:
> 
> Hi Furkan,
>> 
>> Thanks for the links. I had read the first one but not the second one.
>> I
>> did read it after you sent. So in my current solrconfig.xml settings
>> 
> below
 
> are the configurations,
>> 
>> 
>>${solr.autoSoftCommit.maxTime:1}
>>  
>> 
>> 
>> 
>>15000
>>false
>>  
>> 
>> The problem i'm facing is, just after adding the documents to solr
>> using
>> solrj, when I retrieve data from solr I 

Re: Soft commit and reading data just after the commit

2016-12-19 Thread Lasitha Wattaladeniya
Hi Hendrik,

Thanks for your input. Previously I was using the hard commit
(SolrClient.commit()) but then I got some error when there are concurrent
real time index requests from my app. The error was  "Exceeded limit of
maxWarmingSearchers=2, try again later", then i changed the code to use
only solrserver.add(docs) method and configured autoSoftCommit timeout and
autoCommit timeout in solrConfig.

I think, i'll get the same error when I use the method you described
(SolrClient.commit(String
collection, boolean waitFlush, boolean waitSearcher, boolean softCommit)),
Anyway i'll have a look at that method.

Best regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893 <+65%209389%206893>
Blog : techreadme.blogspot.com

On Tue, Dec 20, 2016 at 3:31 AM, Hendrik Haddorp 
wrote:

> Hi,
>
> the SolrJ API has this method: SolrClient.commit(String collection,
> boolean waitFlush, boolean waitSearcher, boolean softCommit).
> My assumption so far was that when you set waitSearcher to true that the
> method call only returns once a search would find the new data, which
> sounds what you want. I used this already and it seemed to work just fine.
>
> regards,
> Hendrik
>
>
> On 19.12.2016 04:09, Lasitha Wattaladeniya wrote:
>
>> Hi all,
>>
>> Thanks for your replies,
>>
>> @dorian : the requirement is,  we are showing a list of entries on a page.
>> For each user there's a read / unread flag.  The data for listing is
>> fetched from solr. And you can see the entry was previously read or not.
>> So
>> when a user views an entry by clicking.  We are updating the database flag
>> to READ and use real time indexing to update solr entry.  So when the user
>> close the full view of the entry and go back to entry listing page,  the
>> data fetched from solr should be updated to READ. That's the use case we
>> are trying to fix.
>>
>> @eric : thanks for the lengthy reply.  So let's say I increase the
>> autosoftcommit time out to may be 100 ms.  In that case do I have to wait
>> much that time from client side before calling search ?.  What's the
>> correct way of achieving this?
>>
>> Regards,
>> Lasitha
>>
>> On 18 Dec 2016 23:52, "Erick Erickson"  wrote:
>>
>> 1 ms autocommit is far too frequent. And it's not
>>> helping you anyway.
>>>
>>> There is some lag between when a commit happens
>>> and when the docs are really available. The sequence is:
>>> 1> commit (soft or hard-with-opensearcher=true doesn't matter).
>>> 2> a new searcher is opened and autowarming starts
>>> 3> until the new searcher is opened, queries continue to be served by
>>> the old searcher
>>> 4> the new searcher is fully opened
>>> 5> _new_ requests are served by the new searcher.
>>> 6> the last request is finished by the old searcher and it's closed.
>>>
>>> So what's probably happening is that you send docs and then send a
>>> query and Solr is still in step <3>. You can look at your admin UI
>>> pluginst/stats page or your log to see how long it takes for a
>>> searcher to open and adjust your expectations accordingly.
>>>
>>> If you want to fetch only the document (not try to get it by a
>>> search), Real Time Get is designed to insure that you always get the
>>> most recent copy whether it's searchable or not.
>>>
>>> All that said, Solr wasn't designed for autocommits that are that
>>> frequent. That's why the documentation talks about _Near_ Real Time.
>>> You may need to adjust your expectations.
>>>
>>> Best,
>>> Erick
>>>
>>> On Sun, Dec 18, 2016 at 6:49 AM, Dorian Hoxha 
>>> wrote:
>>>
 There's a very high probability that you're using the wrong tool for the
 job if you need 1ms softCommit time. Especially when you always need it

>>> (ex
>>>
 there are apps where you need commit-after-insert very rarely).

 So explain what you're using it for ?

 On Sun, Dec 18, 2016 at 3:38 PM, Lasitha Wattaladeniya <

>>> watt...@gmail.com>
>>>
 wrote:

 Hi Furkan,
>
> Thanks for the links. I had read the first one but not the second one.
> I
> did read it after you sent. So in my current solrconfig.xml settings
>
 below
>>>
 are the configurations,
>
> 
> ${solr.autoSoftCommit.maxTime:1}
>   
>
>
> 
> 15000
> false
>   
>
> The problem i'm facing is, just after adding the documents to solr
> using
> solrj, when I retrieve data from solr I am not getting the updated
>
 results.
>>>
 This happens time to time. Most of the time I get the correct data but
>
 in
>>>
 some occasions I get wrong results. so as you suggest, what the best
> practice to use here ? , should I wait 1 mili second before calling for
> updated results ?
>
> Regards,
> Lasitha
>
> Lasitha Wattaladeniya
> Software Engineer
>
> Mobile : +6593896893
> Blog : techreadme.blogspot.com

Re: Soft commit and reading data just after the commit

2016-12-19 Thread Lasitha Wattaladeniya
Hi Shawn,

Thanks for your well detailed explanation. Now I understand, I won't be
able to achieve the 100ms softcommit timeout with my hardware setup.
However let's say someone has a requirement as below (quoted from my
previous mail)

*Requirement *is,  we are showing a list of entries on a page. For each
user there's a read / unread flag.  The data for listing is fetched from
solr. And you can see the entry was previously read or not. So when a user
views an entry by clicking.  We are updating the database flag to READ and
use real time indexing to update solr index.  So when the user close the
full view of the entry and go back to entry listing page,  the data fetched
from solr should be updated to READ.

Can't we achieve a requirement as described above using solr ? (without
manipulating the previously fetched results list from solr, because at some
point we'll have to go back to search results from solr and at that time it
should be updated).

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Mon, Dec 19, 2016 at 6:37 PM, Shawn Heisey  wrote:

> On 12/18/2016 7:09 PM, Lasitha Wattaladeniya wrote:
> > @eric : thanks for the lengthy reply. So let's say I increase the
> > autosoftcommit time out to may be 100 ms. In that case do I have to
> > wait much that time from client side before calling search ?. What's
> > the correct way of achieving this?
>
> Some of the following is covered by the links you've already received.
> Some of it may be new information.
>
> Before you can see a change you've just made, you will need to wait for
> the commit to be fired (in this case, the autoSoftCommit time) plus
> however long it actually takes to complete the commit and open a new
> searcher.  Opening the searcher is the expensive part.
>
> What I typically recommend that people do is have the autoSoftCommit
> time as long as they can stand, with 60-300 seconds as a "typical"
> value.  That's a setting of 6 to 30.  What you are trying to
> achieve is much faster, and much more difficult.
>
> 100 milliseconds will typically be far too small a value unless your
> index is extremely small or your hardware is incredibly fast and has a
> lot of memory.  With a value of 100, you'll want each of those soft
> commits (which do open a new searcher) to take FAR less than 100
> milliseconds to complete.  This kind of speed can be difficult to
> achieve, especially if the index is large.
>
> To have any hope of fast commit times, you will need to set
> autowarmCount on all Solr caches to zero.  If you are indexing
> frequently enough, you might even want to completely disable Solr's
> internal caches, because they may be providing no benefit.
>
> You will want to have enough extra memory that your operating system can
> cache the vast majority (or even maybe all) of your index.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Some other info that's helpful for understanding why plenty of *spare*
> memory (not allocated by programs) is necessary for good performance:
>
> https://en.wikipedia.org/wiki/Page_cache
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> The reason in a nutshell:  Disks are EXTREMELY slow.  Memory is very fast.
>
> Thanks,
> Shawn
>
>


Re: Stop Solr Node (in distress)?

2016-12-19 Thread Erick Erickson
The first question is _why_ is your disk full? Older versions of Solr
could, for instance, accumulate solr console log files forever. If
that's the case, just stop the Solr instance on that node, remove the
Solr log files, fix the log4j.properties file to not append to console
forever and you're done.

If it's more complicated, I'd
1> Spin up a new Solr node with adequate disk space.
2> use the Collections API ADDREPLICA command to add new replicas to
this node. The indexes will be synchronized etc.
3> once <2> is complete and the newly added replicas are green, issue
a DELETEREPLICA on the replicas on the Solr node in distress.

You could issue the DELETEREPLICA  before creating a new Solr instance
on a new node, but doing it in the order I indicated has the fewest
chances for doing something awful like deleting the replica on the
good node by mistake without having a healthy replica in your
cluster

Best,
Erick

On Mon, Dec 19, 2016 at 5:56 PM, Brian Narsi  wrote:
> We have had a situation where Solr node was in distress due to hard drive
> being full and the queries became very slow. Since our Solr cluster has two
> nodes with indexes being fully available on both the nodes, we think that
> one good solution would be to just stop the Solr instance on a distressed
> node. That way the other node will continue to serve queries until the
> distressed node is fixed.
>
> Can folks please share their experiences in such a situation?
>
> What are some good ways to handle this? How/when should we detect that the
> host (linux) is in distress and then stop the solr node?
>
> Thanks,


Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Joel Bernstein
I took another look at the stack trace and I'm pretty sure the issue is
with NULL values in one of the sort fields. The null pointer is occurring
during the comparison of sort values. See line 85 of:
https://github.com/apache/lucene-solr/blob/branch_5_5/solr/solrj/src/java/org/apache/solr/client/solrj/io/comp/FieldComparator.java

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Dec 19, 2016 at 4:43 PM, Chetas Joshi 
wrote:

> Hi Joel,
>
> I don't have any solr documents that have NULL values for the sort fields I
> use in my queries.
>
> Thanks!
>
> On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein 
> wrote:
>
> > Ok, based on the stack trace I suspect one of your sort fields has NULL
> > values, which in the 5x branch could produce null pointers if a segment
> had
> > no values for a sort field. This is also fixed in the Solr 6x branch.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi 
> > wrote:
> >
> > > Here is the stack trace.
> > >
> > > java.lang.NullPointerException
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> > > compare(FieldComparator.java:85)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > > compare(FieldComparator.java:92)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > > compare(FieldComparator.java:30)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> > MultiComp.java:45)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> > MultiComp.java:33)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > > TupleWrapper.compareTo(CloudSolrStream.java:396)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > > TupleWrapper.compareTo(CloudSolrStream.java:381)
> > >
> > > at java.util.TreeMap.put(TreeMap.java:560)
> > >
> > > at java.util.TreeSet.add(TreeSet.java:255)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > > read(CloudSolrStream.java:366)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > > read(CloudSolrStream.java:353)
> > >
> > > at
> > >
> > > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> > > scala:101)
> > >
> > > at java.lang.Thread.run(Thread.java:745)
> > >
> > > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> > > number:
> > > char=A,position=106596
> > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> > >
> > > org.noggit.JSONParser$ParseException: missing exponent number:
> > > char=A,position=106596
> > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> > >
> > > at org.noggit.JSONParser.err(JSONParser.java:356)
> > >
> > > at org.noggit.JSONParser.readExp(JSONParser.java:513)
> > >
> > > at org.noggit.JSONParser.readNumber(JSONParser.java:419)
> > >
> > > at org.noggit.JSONParser.next(JSONParser.java:845)
> > >
> > > at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
> > >
> > > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
> > >
> > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
> > >
> > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > > next(JSONTupleStream.java:84)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.SolrStream.read(
> > > SolrStream.java:147)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.next(
> > > CloudSolrStream.java:413)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > > read(CloudSolrStream.java:365)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > > read(CloudSolrStream.java:353)
> > >
> > >
> > > Thanks!
> > >
> > > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM 
> wrote:
> > >
> > > > If you could provide the json parse exception stack trace, it might
> > help
> > > to
> > > > predict issue there.
> > > >
> > > >
> > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi <
> chetas.jo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Joel,
> > > > >
> > > > > The only NON alpha-numeric characters I have in my data are '+' and
> > > '/'.
> > > > I
> > > > > don't have any backslashes.
> > > > >
> > > > > If the special characters was the issue, I should get the JSON
> > parsing
> > > > > exceptions every time irrespective of the index size and
> irrespective
> > > of
> > > > > the available memory on the 

Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Chetas Joshi
Hi Joel,

I don't have any solr documents that have NULL values for the sort fields I
use in my queries.

Thanks!

On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein  wrote:

> Ok, based on the stack trace I suspect one of your sort fields has NULL
> values, which in the 5x branch could produce null pointers if a segment had
> no values for a sort field. This is also fixed in the Solr 6x branch.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi 
> wrote:
>
> > Here is the stack trace.
> >
> > java.lang.NullPointerException
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> > compare(FieldComparator.java:85)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:92)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:30)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:45)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:33)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:396)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:381)
> >
> > at java.util.TreeMap.put(TreeMap.java:560)
> >
> > at java.util.TreeSet.add(TreeSet.java:255)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:366)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> > at
> >
> > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> > scala:101)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> > number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> > org.noggit.JSONParser$ParseException: missing exponent number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> > at org.noggit.JSONParser.err(JSONParser.java:356)
> >
> > at org.noggit.JSONParser.readExp(JSONParser.java:513)
> >
> > at org.noggit.JSONParser.readNumber(JSONParser.java:419)
> >
> > at org.noggit.JSONParser.next(JSONParser.java:845)
> >
> > at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
> >
> > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
> >
> > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
> >
> > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > next(JSONTupleStream.java:84)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.SolrStream.read(
> > SolrStream.java:147)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> TupleWrapper.next(
> > CloudSolrStream.java:413)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:365)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> >
> > Thanks!
> >
> > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM  wrote:
> >
> > > If you could provide the json parse exception stack trace, it might
> help
> > to
> > > predict issue there.
> > >
> > >
> > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > The only NON alpha-numeric characters I have in my data are '+' and
> > '/'.
> > > I
> > > > don't have any backslashes.
> > > >
> > > > If the special characters was the issue, I should get the JSON
> parsing
> > > > exceptions every time irrespective of the index size and irrespective
> > of
> > > > the available memory on the machine. That is not the case here. The
> > > > streaming API successfully returns all the documents when the index
> > size
> > > is
> > > > small and fits in the available memory. That's the reason I am
> > confused.
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> > > > wrote:
> > > >
> > > > > The Streaming API may have been throwing exceptions because the
> JSON
> > > > > special characters were not escaped. This was fixed in Solr 6.0.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <
> > chetas.jo...@gmail.com>
> > > > > wrote:
> > > > >

Re: Stats component's percentiles are incorrect

2016-12-19 Thread John Blythe
very good point, walter. i think we could find some cool ways to leverage
this intelligence for our users after serving up the flattened version
based on the simple range that they're expecting to see. the clarity is
helpful in getting some creative ideas moving, so thanks.

best,

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, Dec 19, 2016 at 4:17 PM, Walter Underwood 
wrote:

> Percentiles are far more useful than that linear approximation. That is
> just slope and intercept, basically two numbers.
>
> With percentiles, I can answer the question “how fast is the search for
> 95% of my visitors?” With that linear interpolation, I don’t know anything
> about my customers.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Dec 19, 2016, at 12:51 PM, John Blythe  wrote:
> >
> > gotcha. yup, that was the back up plan so i think i'll go that route for
> > now.
> >
> > thanks for the info!
> >
> > best,
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, Dec 19, 2016 at 3:41 PM, Toke Eskildsen 
> > wrote:
> >
> >> John Blythe  wrote:
> >>> if the range is 0 to 100 then, for my current purposes, i don't care if
> >> the
> >>> vast majority of the values are 92, i would want 25%=>25, 50%=>50, and
> >>> 75%=>75. so is there an out-of-the-box way to get the percentiles to
> >>> correspond to the range itself rather than the concentration of
> distinct
> >>> values?
> >>
> >> Then it is not percentiles. And I don't know of any build-in function
> that
> >> returns them directly.
> >>
> >> But as you have the min and max, you can just do
> >> 25%: (max-min)*0.25+min
> >> 50%: (max-min)*0.5+min
> >> 75%: (max-min)*0.75+min
> >>
> >> But of course, that won't guarantee that you match the distinct values.
> If
> >> you want that, you'll have to iterate the list of distinct values (hope
> >> it's not too large) and pick out the nearest ones.
> >>
> >> - Toke Eskildsen
> >>
>
>


Re: Stats component's percentiles are incorrect

2016-12-19 Thread Walter Underwood
Percentiles are far more useful than that linear approximation. That is just 
slope and intercept, basically two numbers.

With percentiles, I can answer the question “how fast is the search for 95% of 
my visitors?” With that linear interpolation, I don’t know anything about my 
customers.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 19, 2016, at 12:51 PM, John Blythe  wrote:
> 
> gotcha. yup, that was the back up plan so i think i'll go that route for
> now.
> 
> thanks for the info!
> 
> best,
> 
> -- 
> *John Blythe*
> Product Manager & Lead Developer
> 
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
> 
> 58 Adams Ave
> Evansville, IN 47713
> 
> On Mon, Dec 19, 2016 at 3:41 PM, Toke Eskildsen 
> wrote:
> 
>> John Blythe  wrote:
>>> if the range is 0 to 100 then, for my current purposes, i don't care if
>> the
>>> vast majority of the values are 92, i would want 25%=>25, 50%=>50, and
>>> 75%=>75. so is there an out-of-the-box way to get the percentiles to
>>> correspond to the range itself rather than the concentration of distinct
>>> values?
>> 
>> Then it is not percentiles. And I don't know of any build-in function that
>> returns them directly.
>> 
>> But as you have the min and max, you can just do
>> 25%: (max-min)*0.25+min
>> 50%: (max-min)*0.5+min
>> 75%: (max-min)*0.75+min
>> 
>> But of course, that won't guarantee that you match the distinct values. If
>> you want that, you'll have to iterate the list of distinct values (hope
>> it's not too large) and pick out the nearest ones.
>> 
>> - Toke Eskildsen
>> 



Re: Stats component's percentiles are incorrect

2016-12-19 Thread John Blythe
gotcha. yup, that was the back up plan so i think i'll go that route for
now.

thanks for the info!

best,

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, Dec 19, 2016 at 3:41 PM, Toke Eskildsen 
wrote:

> John Blythe  wrote:
> > if the range is 0 to 100 then, for my current purposes, i don't care if
> the
> > vast majority of the values are 92, i would want 25%=>25, 50%=>50, and
> > 75%=>75. so is there an out-of-the-box way to get the percentiles to
> > correspond to the range itself rather than the concentration of distinct
> > values?
>
> Then it is not percentiles. And I don't know of any build-in function that
> returns them directly.
>
> But as you have the min and max, you can just do
> 25%: (max-min)*0.25+min
> 50%: (max-min)*0.5+min
> 75%: (max-min)*0.75+min
>
> But of course, that won't guarantee that you match the distinct values. If
> you want that, you'll have to iterate the list of distinct values (hope
> it's not too large) and pick out the nearest ones.
>
> - Toke Eskildsen
>


Re: Stats component's percentiles are incorrect

2016-12-19 Thread Toke Eskildsen
John Blythe  wrote:
> if the range is 0 to 100 then, for my current purposes, i don't care if the
> vast majority of the values are 92, i would want 25%=>25, 50%=>50, and
> 75%=>75. so is there an out-of-the-box way to get the percentiles to
> correspond to the range itself rather than the concentration of distinct
> values?

Then it is not percentiles. And I don't know of any build-in function that 
returns them directly.

But as you have the min and max, you can just do
25%: (max-min)*0.25+min
50%: (max-min)*0.5+min
75%: (max-min)*0.75+min

But of course, that won't guarantee that you match the distinct values. If you 
want that, you'll have to iterate the list of distinct values (hope it's not 
too large) and pick out the nearest ones.

- Toke Eskildsen


Re: Stats component's percentiles are incorrect

2016-12-19 Thread John Blythe
mm, i was afraid something like that might be the case.

if the range is 0 to 100 then, for my current purposes, i don't care if the
vast majority of the values are 92, i would want 25%=>25, 50%=>50, and
75%=>75. so is there an out-of-the-box way to get the percentiles to
correspond to the range itself rather than the concentration of distinct
values?

thanks for any continued insight here!

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Mon, Dec 19, 2016 at 3:12 PM, Toke Eskildsen 
wrote:

> John Blythe  wrote:
> > 102
> ...
> > 6
>
> 102 values, but only 6 distinct (aka unique): 3900, 3998, 4098, 4200, 4305
> and 4413.
>
> > 
> > 4305.0
> > 4413.0
> > 4413.0
>
> >   - the 50th and 75%  are the same value as the max
> >   - the 50th and 75th % are the same number as one another
>
> That is not a sign of an error. But it does tell us that at least half of
> your 102 values are 4413. Forgetting your 25% and your mean value for a
> moment, the 102 values could be
> 3900, 3998, 4098, 4200, 4305, 4413, 4413, 4413... (94 more repeats of
> 4413).
>
> - Toke Eskildsen
>


Re: Stats component's percentiles are incorrect

2016-12-19 Thread Toke Eskildsen
John Blythe  wrote:
> 102
...
> 6

102 values, but only 6 distinct (aka unique): 3900, 3998, 4098, 4200, 4305 and 
4413.

> 
> 4305.0
> 4413.0
> 4413.0

>   - the 50th and 75%  are the same value as the max
>   - the 50th and 75th % are the same number as one another

That is not a sign of an error. But it does tell us that at least half of your 
102 values are 4413. Forgetting your 25% and your mean value for a moment, the 
102 values could be
3900, 3998, 4098, 4200, 4305, 4413, 4413, 4413... (94 more repeats of 4413).

- Toke Eskildsen


How to clear the Collection Creation failed errors?

2016-12-19 Thread srinalluri
I tried to create two collections but failed due to some known reasons. I
want to clear the errors which you see the attached screenshot. Please see
the screenshot. As the collections failed to create, there is no point to
delete the collection.


 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-clear-the-Collection-Creation-failed-errors-tp4310446.html
Sent from the Solr - User mailing list archive at Nabble.com.


Stats component's percentiles are incorrect

2016-12-19 Thread John Blythe
hi, all.

i've begun recruiting solr stats for some nifty little insights for our
users' data. it seems to be running just fine in most cases, but i have
noticed that there is a fringe group of results that seem to have incorrect
data.

for instance, one query returns the following output;




3900.0
4413.0
102

3900.0
3998.0
4098.0
4200.0
4305.0
4413.0

6
439300.0
1.89426839E9
4306.862745098039
149.70552211193035

4305.0
4413.0
4413.0





couple things to notice:

   - the 50th and 75%  are the same value as the max
   - the 50th and 75th % are the same number as one another

is there another setting i'm overlooking in the docs, is this a known bug,
or am i simply off the mark in my calculations/expectations?

thanks for any info-


Re: Soft commit and reading data just after the commit

2016-12-19 Thread Hendrik Haddorp

Hi,

the SolrJ API has this method: SolrClient.commit(String collection, 
boolean waitFlush, boolean waitSearcher, boolean softCommit).
My assumption so far was that when you set waitSearcher to true that the 
method call only returns once a search would find the new data, which 
sounds what you want. I used this already and it seemed to work just fine.


regards,
Hendrik

On 19.12.2016 04:09, Lasitha Wattaladeniya wrote:

Hi all,

Thanks for your replies,

@dorian : the requirement is,  we are showing a list of entries on a page.
For each user there's a read / unread flag.  The data for listing is
fetched from solr. And you can see the entry was previously read or not. So
when a user views an entry by clicking.  We are updating the database flag
to READ and use real time indexing to update solr entry.  So when the user
close the full view of the entry and go back to entry listing page,  the
data fetched from solr should be updated to READ. That's the use case we
are trying to fix.

@eric : thanks for the lengthy reply.  So let's say I increase the
autosoftcommit time out to may be 100 ms.  In that case do I have to wait
much that time from client side before calling search ?.  What's the
correct way of achieving this?

Regards,
Lasitha

On 18 Dec 2016 23:52, "Erick Erickson"  wrote:


1 ms autocommit is far too frequent. And it's not
helping you anyway.

There is some lag between when a commit happens
and when the docs are really available. The sequence is:
1> commit (soft or hard-with-opensearcher=true doesn't matter).
2> a new searcher is opened and autowarming starts
3> until the new searcher is opened, queries continue to be served by
the old searcher
4> the new searcher is fully opened
5> _new_ requests are served by the new searcher.
6> the last request is finished by the old searcher and it's closed.

So what's probably happening is that you send docs and then send a
query and Solr is still in step <3>. You can look at your admin UI
pluginst/stats page or your log to see how long it takes for a
searcher to open and adjust your expectations accordingly.

If you want to fetch only the document (not try to get it by a
search), Real Time Get is designed to insure that you always get the
most recent copy whether it's searchable or not.

All that said, Solr wasn't designed for autocommits that are that
frequent. That's why the documentation talks about _Near_ Real Time.
You may need to adjust your expectations.

Best,
Erick

On Sun, Dec 18, 2016 at 6:49 AM, Dorian Hoxha 
wrote:

There's a very high probability that you're using the wrong tool for the
job if you need 1ms softCommit time. Especially when you always need it

(ex

there are apps where you need commit-after-insert very rarely).

So explain what you're using it for ?

On Sun, Dec 18, 2016 at 3:38 PM, Lasitha Wattaladeniya <

watt...@gmail.com>

wrote:


Hi Furkan,

Thanks for the links. I had read the first one but not the second one. I
did read it after you sent. So in my current solrconfig.xml settings

below

are the configurations,


${solr.autoSoftCommit.maxTime:1}
  



15000
false
  

The problem i'm facing is, just after adding the documents to solr using
solrj, when I retrieve data from solr I am not getting the updated

results.

This happens time to time. Most of the time I get the correct data but

in

some occasions I get wrong results. so as you suggest, what the best
practice to use here ? , should I wait 1 mili second before calling for
updated results ?

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Sun, Dec 18, 2016 at 8:46 PM, Furkan KAMACI 
wrote:


Hi Lasitha,

First of all, did you check these:

https://cwiki.apache.org/confluence/display/solr/Near+

Real+Time+Searching

https://lucidworks.com/blog/2013/08/23/understanding-
transaction-logs-softcommit-and-commit-in-sorlcloud/

after that, if you cannot adjust your configuration you can give more
information and we can find a solution.

Kind Regards,
Furkan KAMACI

On Sun, Dec 18, 2016 at 2:28 PM, Lasitha Wattaladeniya <

watt...@gmail.com>

wrote:


Hi furkan,

Thanks for your reply, it is generally a query heavy system. We are

using

realtime indexing for editing the available data

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893 <+65%209389%206893>
Blog : techreadme.blogspot.com

On Sun, Dec 18, 2016 at 8:12 PM, Furkan KAMACI <

furkankam...@gmail.com>

wrote:


Hi Lasitha,

What is your indexing / querying requirements. Do you have an index
heavy/light  - query heavy/light system?

Kind Regards,
Furkan KAMACI

On Sun, Dec 18, 2016 at 11:35 AM, Lasitha Wattaladeniya <
watt...@gmail.com>
wrote:


Hello devs,

I'm here with another problem i'm facing. I'm trying to do a

commit

(soft

commit) through solrj and just after the commit, retrieve the data

from

solr (requirement is to get 

Re: Confusing debug=timing parameter

2016-12-19 Thread Walter Underwood
One other thing. How many results are being requested? That is, what is the 
“rows” parameter?

Time includes query time. It does not include networking time for sending 
10,000 huge results to the client.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 19, 2016, at 9:47 AM, Chris Hostetter  wrote:
> 
> 
> SG:
> 
> IIRC, when doing a distributed/cloud search, the timing info returned for 
> each stage is the *cummulative* time spent on that stage in all shards -- 
> so if you have 4 shards, the "process" time reported could be 4x as much 
> as the actual process time spent.
> 
> The QTime reported back (in a distributed query) is the "wall clock' time 
> spent by the single node that coordinated the response, *NOT* including 
> time spent writting the response back to the client.
> 
> 
> So let's imagine you have a single Solr node, and a client that says the 
> total time for a Solr query took 5 seconds, but the QTime reported by solr 
> is 1 second -- that suggests that 4 seconds was spent in either some sort 
> of general network overhead between the client & solr, or in time spent 
> reading docs from disk if (AND ONLY IF) Solr is running in single node 
> mode, where Solr defers disk reads as long as possible and does those 
> reads only as needed to write the response.  In this single node setup, 
> you should *NEVER* see the cumulative debug=timing time exceed the QTime 
> (for any reason i can think of)
> 
> In a distributed query, with multiple solr nodes, any discrepency between 
> the QTime and the time reported by the client is going to be related to 
> network overhead (or client overhead, parsing hte response) because Solr 
> isn't able to do any "local reads" of docs from disk when writting the 
> response.  imagine in this situation you have client reports 5 seconds, 
> QTime reports 1 second, and debug=timing reports 45 seconds.   that 45 
> seconds isn't "real" wall clock time, it's just 45 total seconds of time 
> spent on all the nodes in parallel added up cumulatively.  the 5 second vs 
> 1 second discrepency would be from the network communication overhead, or 
> client overhead in parsing hte response before proting "success"
> 
> does that make sense?
> 
> 
> 
> : Date: Sat, 17 Dec 2016 08:43:53 -0800
> : From: S G 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Confusing debug=timing parameter
> : 
> : Hi,
> : 
> : I am using Solr 4.10 and its response time for the clients is not very good.
> : Even though the Solr's plugin/stats shows less than 200 milliseconds,
> : clients report several seconds in response time.
> : 
> : So I tried using debug-timing parameter from the Solr UI and this is what I
> : got.
> : Note how the QTime is 2978 while the time in debug-timing is 19320.
> : 
> : What does this mean?
> : How can Solr return a result in 3 seconds when time taken between two
> : points in the same path is 20 seconds ?
> : 
> : {
> :   "responseHeader": {
> : "status": 0,
> : "QTime": 2978,
> : "params": {
> :   "q": "*:*",
> :   "debug": "timing",
> :   "indent": "true",
> :   "wt": "json",
> :   "_": "1481992653008"
> : }
> :   },
> :   "response": {
> : "numFound": 1565135270,
> : "start": 0,
> : "maxScore": 1,
> : "docs": [
> :   
> : ]
> :   },
> :   "debug": {
> : "timing": {
> :   "time": 19320,
> :   "prepare": {
> : "time": 4,
> : "query": {
> :   "time": 3
> : },
> : "facet": {
> :   "time": 0
> : },
> : "mlt": {
> :   "time": 0
> : },
> : "highlight": {
> :   "time": 0
> : },
> : "stats": {
> :   "time": 0
> : },
> : "expand": {
> :   "time": 0
> : },
> : "debug": {
> :   "time": 0
> : }
> :   },
> :   "process": {
> : "time": 19315,
> : "query": {
> :   "time": 19309
> : },
> : "facet": {
> :   "time": 0
> : },
> : "mlt": {
> :   "time": 1
> : },
> : "highlight": {
> :   "time": 0
> : },
> : "stats": {
> :   "time": 0
> : },
> : "expand": {
> :   "time": 0
> : },
> : "debug": {
> :   "time": 5
> : }
> :   }
> : }
> :   }
> : }
> : 
> 
> -Hoss
> http://www.lucidworks.com/



Re: ttl on merge-time possible somehow ?

2016-12-19 Thread Chris Hostetter

: So, the other way this can be made better in my opinion is (if the
: optimization is not already there)
: Is to make the 'delete-query' on ttl-documents operation on translog to not
: be forced to fsync to disk (so still written to translog, but no fsync).
: The another index/delete happens, it will also fsync the translog of the
: previous 'delete ttl query'.
: If the server crashes, meaning we lost those deletes because the translog
: wasn't fsynced to disk, then a thread can run on startup to recheck
: ttl-deletes.
: This will make it so the delete-query comes "free" in disk-fsync on
: translog.
: Makes sense ?

All updates in Solr operate on both the in memory IndexWriter and the 
(Solr specific) transaction log, and only when a "hard commit" happens is 
the IndexWriter closed (causing segment files to fsync) ... the TTL code 
only does a "soft commit" which should not do any fsyncs on the index.



-Hoss
http://www.lucidworks.com/


Trying to figure out a solution for a problem

2016-12-19 Thread Shankar Krish
Hello,
I am trying to find a solution for a specific search context that is not 
working the way I expect it to work.
Let me explain in detail:
The setup
I have a Solr instance setup with data (about 3.5 million documents). The 
schema has been setup with searchable text fields etc; One of the fields is 
called the Title which contains the title of the document.

Search Context:
On finding that support for multi-word synonyms in Solr is problematic, we are 
using using query based approach for dealing with synonyms.
For a given search term, the search process is:

1.   Identify the synonyms of the search term from a non-Solr source

2.   String together the search term and synonyms as the basis for Solr 
search (in the form "srch term" "synonym a" "synonym b")

3.   Influence the results with two boost factors:

a.   Using bq and the search text in the form bq= 
(srch+term+synonym+a+synonym+b)^10

b.   Using bf on a field bf=product(EarnFactor,0.15)

As you can see the idea behind the use of synonyms is to ensure that the user 
entered term and its variations are used in searching for matches; and the use 
of bq on the title is for the very same reason.

The results returned by Solr are not per expectations and based on 
experimentation the observation is that the bq boost is the reason for this; if 
the bq boost is taken away, the results look decent, but the weightage that we 
are looking to assign for the term in the title is being completely lost.
Then we set the bq with the user entered term alone and the results seem decent 
but the shortcoming is that the titles with the equivalent terms are not in the 
mix.

While in the search context we are able to use/leverage the term based search 
text (search text within quotes), the bq boost does not seem to work with terms 
and that appears to be the source of our problem. I tried changing the search 
parameters so that individual bq's were used for the search term and it 
synonyms but that does not change the results in any manner.

The desire is that we would like to boost the score of a document should any of 
the terms used in the search appear in the title.

Any help in this context would be great.

I don't believe I can claim this to be a bug, because the documentation for bq 
does not describe the capability from a term perspective. It could be an 
enhancement request however.

We are using Solr 4.8. We were compelled to switch back from 6.2 to 4.8 because 
of performance issues with Group Facet.

Thanks & Regards
Shankar Krish


Re: Confusing debug=timing parameter

2016-12-19 Thread Chris Hostetter

SG:

IIRC, when doing a distributed/cloud search, the timing info returned for 
each stage is the *cummulative* time spent on that stage in all shards -- 
so if you have 4 shards, the "process" time reported could be 4x as much 
as the actual process time spent.

The QTime reported back (in a distributed query) is the "wall clock' time 
spent by the single node that coordinated the response, *NOT* including 
time spent writting the response back to the client.


So let's imagine you have a single Solr node, and a client that says the 
total time for a Solr query took 5 seconds, but the QTime reported by solr 
is 1 second -- that suggests that 4 seconds was spent in either some sort 
of general network overhead between the client & solr, or in time spent 
reading docs from disk if (AND ONLY IF) Solr is running in single node 
mode, where Solr defers disk reads as long as possible and does those 
reads only as needed to write the response.  In this single node setup, 
you should *NEVER* see the cumulative debug=timing time exceed the QTime 
(for any reason i can think of)

In a distributed query, with multiple solr nodes, any discrepency between 
the QTime and the time reported by the client is going to be related to 
network overhead (or client overhead, parsing hte response) because Solr 
isn't able to do any "local reads" of docs from disk when writting the 
response.  imagine in this situation you have client reports 5 seconds, 
QTime reports 1 second, and debug=timing reports 45 seconds.   that 45 
seconds isn't "real" wall clock time, it's just 45 total seconds of time 
spent on all the nodes in parallel added up cumulatively.  the 5 second vs 
1 second discrepency would be from the network communication overhead, or 
client overhead in parsing hte response before proting "success"

does that make sense?



: Date: Sat, 17 Dec 2016 08:43:53 -0800
: From: S G 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Confusing debug=timing parameter
: 
: Hi,
: 
: I am using Solr 4.10 and its response time for the clients is not very good.
: Even though the Solr's plugin/stats shows less than 200 milliseconds,
: clients report several seconds in response time.
: 
: So I tried using debug-timing parameter from the Solr UI and this is what I
: got.
: Note how the QTime is 2978 while the time in debug-timing is 19320.
: 
: What does this mean?
: How can Solr return a result in 3 seconds when time taken between two
: points in the same path is 20 seconds ?
: 
: {
:   "responseHeader": {
: "status": 0,
: "QTime": 2978,
: "params": {
:   "q": "*:*",
:   "debug": "timing",
:   "indent": "true",
:   "wt": "json",
:   "_": "1481992653008"
: }
:   },
:   "response": {
: "numFound": 1565135270,
: "start": 0,
: "maxScore": 1,
: "docs": [
:   
: ]
:   },
:   "debug": {
: "timing": {
:   "time": 19320,
:   "prepare": {
: "time": 4,
: "query": {
:   "time": 3
: },
: "facet": {
:   "time": 0
: },
: "mlt": {
:   "time": 0
: },
: "highlight": {
:   "time": 0
: },
: "stats": {
:   "time": 0
: },
: "expand": {
:   "time": 0
: },
: "debug": {
:   "time": 0
: }
:   },
:   "process": {
: "time": 19315,
: "query": {
:   "time": 19309
: },
: "facet": {
:   "time": 0
: },
: "mlt": {
:   "time": 1
: },
: "highlight": {
:   "time": 0
: },
: "stats": {
:   "time": 0
: },
: "expand": {
:   "time": 0
: },
: "debug": {
:   "time": 5
: }
:   }
: }
:   }
: }
: 

-Hoss
http://www.lucidworks.com/


Re: Stable releases of Solr

2016-12-19 Thread Andrea Gazzarini

Hi Deepak,
the latest version is the 6.3.0 and I guess it is the best to pick up. 
Keep in mind that 3.6.1 => 6.3.0 is definitely a big jump.


In general, I think once a version is made available, that means it is 
(hopefully) stable.


Best,
Andrea

On 16/12/16 08:10, Deepak Kumar Gupta wrote:

Hi,

I am planning to upgrade lucene version in my codebase from 3.6.1
What is the latest stable version to which I can upgrade it?
Is 6.3.X stable?

Thanks,
Deepak





DIH caching URLDataSource/XPath entity (not root)

2016-12-19 Thread Chantal Ackermann
Hi there,

my index is created from XML files that are downloaded on the fly.
This also includes downloading a mapping file that is used to resolve IDs in 
the main file (root entity) and map them onto names.

The basic functionality works - the supplier_name is set for each document.
However, the mapping file is downloaded with every iteration of the root 
entity. In order to avoid this and only have it downloaded once and the mapping 
cached, I have set the cacheKey and cacheLookup properties but the file is 
still requested over and over again.

Has someone worked with multiple different XMLs files with mappings loaded via 
different DIH entities? I’d appreciate any samples or hints.
Or maybe someone is able to spot the error in the following configuration?

(The custom DataSource is a subclass of URLDataSource and handles Basic Auth as 
well as decompression.)













Specifying field in Child as query field

2016-12-19 Thread Navin Kulkarni
Hi,

I plan to use nested document structure for our index and would like to
know how to pick field from child documents as query fields. Normally we do
this using "qf" parameter in solr query and specify search key word in
query field. I tried to mention child field using "qf" but this did not
return results.

-Navin


Re: Soft commit and reading data just after the commit

2016-12-19 Thread Shawn Heisey
On 12/18/2016 7:09 PM, Lasitha Wattaladeniya wrote:
> @eric : thanks for the lengthy reply. So let's say I increase the
> autosoftcommit time out to may be 100 ms. In that case do I have to
> wait much that time from client side before calling search ?. What's
> the correct way of achieving this? 

Some of the following is covered by the links you've already received. 
Some of it may be new information.

Before you can see a change you've just made, you will need to wait for
the commit to be fired (in this case, the autoSoftCommit time) plus
however long it actually takes to complete the commit and open a new
searcher.  Opening the searcher is the expensive part.

What I typically recommend that people do is have the autoSoftCommit
time as long as they can stand, with 60-300 seconds as a "typical"
value.  That's a setting of 6 to 30.  What you are trying to
achieve is much faster, and much more difficult.

100 milliseconds will typically be far too small a value unless your
index is extremely small or your hardware is incredibly fast and has a
lot of memory.  With a value of 100, you'll want each of those soft
commits (which do open a new searcher) to take FAR less than 100
milliseconds to complete.  This kind of speed can be difficult to
achieve, especially if the index is large.

To have any hope of fast commit times, you will need to set
autowarmCount on all Solr caches to zero.  If you are indexing
frequently enough, you might even want to completely disable Solr's
internal caches, because they may be providing no benefit.

You will want to have enough extra memory that your operating system can
cache the vast majority (or even maybe all) of your index.

https://wiki.apache.org/solr/SolrPerformanceProblems

Some other info that's helpful for understanding why plenty of *spare*
memory (not allocated by programs) is necessary for good performance:

https://en.wikipedia.org/wiki/Page_cache
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

The reason in a nutshell:  Disks are EXTREMELY slow.  Memory is very fast.

Thanks,
Shawn



Re: [ANN] InvisibleQueriesRequestHandler

2016-12-19 Thread Mikhail Khludnev
> It has an interesting failure mode. If the user misspells a word (about
> 10% of
> queries do), and the misspelling matches a misspelled document, then you
> are stuck. It will never show the correctly-spelled document.
>

FWIW (and I'm sorry for hijacking) I've faced this challenge too, and found
that DirectSpellchecker has a parameter for it: thresholdTokenFrequency see
https://cwiki.apache.org/confluence/display/solr/Spell+Checking. It can
supress lowfrequent suggestions considered as misspelling in the index.

On Mon, Dec 5, 2016 at 8:16 PM, Walter Underwood 
wrote:

> We used to run that way, with an exact search first, then a broad search if
> there were no results.
>
> It has an interesting failure mode. If the user misspells a word (about
> 10% of
> queries do), and the misspelling matches a misspelled document, then you
> are stuck. It will never show the correctly-spelled document.
>
> For the very popular book Campbell Biology, if you searched for “cambell”,
> it would show a book with Greek plays and one misspelled author. Oops.
>
> We integrated fuzzy search into edismax. With that, we get the popular
> book for misspelled queries.
>
> You can find that patch in SOLR-629. I first implemented it for Solr 1.3,
> and
> I’ve been updating it for years. Very useful, especially with the fast
> fuzzy
> introduced in 4.x.
>
> https://issues.apache.org/jira/browse/SOLR-629  jira/browse/SOLR-629>
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Dec 5, 2016, at 9:08 AM, Charlie Hull  wrote:
> >
> > On 05/12/2016 09:18, Andrea Gazzarini wrote:
> >> Hi guys,
> >> I developed this handler [1] while doing some work on a Magento ->  Solr
> >> project.
> >>
> >> If someone is interested (this is a post [2] where I briefly explain the
> >> goal), or wants to contribute with some idea / improvement, feel free to
> >> give me a shout or a feedback.
> >>
> >> Best,
> >> Andrea
> >>
> >> [1] https://github.com/agazzarini/invisible-queries-request-handler
> >> [2]
> >> https://andreagazzarini.blogspot.it/2016/12/composing-
> and-reusing-request-handlers.html
> >>
> > We like this idea: we've seen plenty of systems where it's hard to
> change what the container system using Solr is doing (e.g. Hybris,
> Drupal...) so to be able to run multiple searches in Solr itself is very
> useful. Nice one!
> >
> > Charlie
> >
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
>
>


-- 
Sincerely yours
Mikhail Khludnev