Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla
No, I haven't had time for that (and unlikely won't have for the next few
weeks), but it is on the list - if it is 25% improvement, it would be
really worth of the change to G1.
Thanks,

roman


On Wed, Jul 31, 2013 at 1:00 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Did you also test indexing speed? With default G1GC settings we're seeing
 a slightly higher latency for queries than CMS. However, G1GC allows for
 much higher throughput than CMS when indexing. I haven't got the raw
 numbers here but it is roughly 45 minutes against 60 in favour of G1GC!

 Load is obviously higher with G1GC.


 -Original message-
  From:Roman Chyla roman.ch...@gmail.com
  Sent: Wednesday 31st July 2013 18:32
  To: solr-user@lucene.apache.org
  Subject: Re: Measuring SOLR performance
 
  I'll try to run it with the new parameters and let you know how it goes.
  I've rechecked details for the G1 (default) garbage collector run and I
 can
  confirm that 2 out of 3 runs were showing high max response times, in
 some
  cases even 10secs, but the customized G1 never - so definitely the
  parameters had effect because the max time for the customized G1 never
 went
  higher than 1.5secs (and that happend for 2 query classes only). Both the
  cms-custom and G1-custom are similar, the G1 seems to have higher values
 in
  the max fields, but that may be random. So, yes, now I am sure what to
  think of default G1 as 'bad', and that these G1 parameters, even if they
  don't seem G1 specific, have real effect.
  Thanks,
 
  roman
 
 
  On Tue, Jul 30, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org
 wrote:
 
   On 7/30/2013 6:59 PM, Roman Chyla wrote:
I have been wanting some tools for measuring performance of SOLR,
 similar
to Mike McCandles' lucene benchmark.
   
so yet another monitor was born, is described here:
   
 http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/
   
I tested it on the problem of garbage collectors (see the blogs for
details) and so far I can't conclude whether highly customized G1 is
   better
than highly customized CMS, but I think interesting details can be
 seen
there.
   
Hope this helps someone, and of course, feel free to improve the
 tool and
share!
  
   I have a CMS config that's even more tuned than before, and it has made
   things MUCH better.  This new config is inspired by more info that I
 got
   on IRC:
  
   http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
  
   The G1 customizations in your blog post don't look like they are really
   G1-specific - they may be useful with CMS as well.  This statement also
   applies to some of the CMS parameters, so I would use those with G1 as
   well for any testing.
  
   UseNUMA looks interesting for machines that actually are NUMA.  All the
   information that I can find says it is only for the throughput
   (parallel) collector, so it's probably not doing anything for G1.
  
   The pause parameters you've got for G1 are targets only.  It will *try*
   to stick within those parameters, but if a collection requires more
 than
   50 milliseconds or has to happen more often than once a second, the
   collector will ignore what you have told it.
  
   Thanks,
   Shawn
  
  
 



Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
Dmitry,
Can you post the entire invocation line?
roman


On Thu, Aug 1, 2013 at 7:46 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 When I try to run with -q
 /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries

 here what is reported:
 Traceback (most recent call last):
   File solrjmeter.py, line 1390, in module
 main(sys.argv)
   File solrjmeter.py, line 1309, in main
 tests = find_tests(options)
   File solrjmeter.py, line 461, in find_tests
 with changed_dir(pattern):
   File /usr/lib/python2.7/contextlib.py, line 17, in __enter__
 return self.gen.next()
   File solrjmeter.py, line 229, in changed_dir
 os.chdir(new)
 OSError: [Errno 20] Not a directory:
 '/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries'

 Best,

 Dmitry



 On Wed, Jul 31, 2013 at 7:21 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Dmitry,
  probably mistake in the readme, try calling it with -q
  /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries
 
  as for the base_url, i was testing it on solr4.0, where it tries
 contactin
  /solr/admin/system - is it different for 4.3? I guess I should make it
  configurable (it already is, the endpoint is set at the check_options())
 
  thanks
 
  roman
 
 
  On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Ok, got the error fixed by modifying the base solr ulr in solrjmeter.py
   (added core name after /solr part).
   Next error is:
  
   WARNING: no test name(s) supplied nor found in:
   ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries']
  
   It is a 'slow start with new tool' symptom I guess.. :)
  
  
   On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
   Hi Roman,
  
   What  version and config of SOLR does the tool expect?
  
   Tried to run, but got:
  
   **ERROR**
 File solrjmeter.py, line 1390, in module
   main(sys.argv)
 File solrjmeter.py, line 1296, in main
   check_prerequisities(options)
 File solrjmeter.py, line 351, in check_prerequisities
   error('Cannot contact: %s' % options.query_endpoint)
 File solrjmeter.py, line 66, in error
   traceback.print_stack()
   Cannot contact: http://localhost:8983/solr
  
  
   complains about URL, clicking which leads properly to the admin
 page...
   solr 4.3.1, 2 cores shard
  
   Dmitry
  
  
   On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
   Hello,
  
   I have been wanting some tools for measuring performance of SOLR,
  similar
   to Mike McCandles' lucene benchmark.
  
   so yet another monitor was born, is described here:
  
  http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/
  
   I tested it on the problem of garbage collectors (see the blogs for
   details) and so far I can't conclude whether highly customized G1 is
   better
   than highly customized CMS, but I think interesting details can be
 seen
   there.
  
   Hope this helps someone, and of course, feel free to improve the tool
  and
   share!
  
   roman
  
  
  
  
 



Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
Hi Bernd,



On Thu, Aug 1, 2013 at 4:07 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Yes, UseNuma is only for Parallel Scavenger garbage collector and only
 for Solaris 9 and higher and Linux kernel 2.6.19 and glibc 2.6.1.
 And it performs with 64-bit better than 32-bit.
 So no effects for G1.

 With standard applications CMS is very slightly better than G1 but
 when it comes to huge heaps with high fragmentation G1 is better than CMS.
 The documentation says, one benefit of G1 is if the application has
 more than 50% of the Java heap occupied with live data.


Could you rephrase this bit please?  I don't understand it, but I think it
is important concern.


So first step is to size the heap that you have about 3/4 of the heap
 occupied with live data and then go on comparing CMS against G1.


Thanks,

  roman




 Otherwise G1 and CMS are about same or as I said CMS might be slightly
 better.

 Also, either turn swap off or also record vmstat. This should make sure
 that during a garbage collection no other system activity, like moving
 JVM heap to swap in background, is distorting your measurements.


 Bernd


 Am 31.07.2013 05:01, schrieb Shawn Heisey:
  On 7/30/2013 6:59 PM, Roman Chyla wrote:
  I have been wanting some tools for measuring performance of SOLR,
 similar
  to Mike McCandles' lucene benchmark.
 
  so yet another monitor was born, is described here:
  http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/
 
  I tested it on the problem of garbage collectors (see the blogs for
  details) and so far I can't conclude whether highly customized G1 is
 better
  than highly customized CMS, but I think interesting details can be seen
  there.
 
  Hope this helps someone, and of course, feel free to improve the tool
 and
  share!
 
  I have a CMS config that's even more tuned than before, and it has made
  things MUCH better.  This new config is inspired by more info that I got
  on IRC:
 
  http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
 
  The G1 customizations in your blog post don't look like they are really
  G1-specific - they may be useful with CMS as well.  This statement also
  applies to some of the CMS parameters, so I would use those with G1 as
  well for any testing.
 
  UseNUMA looks interesting for machines that actually are NUMA.  All the
  information that I can find says it is only for the throughput
  (parallel) collector, so it's probably not doing anything for G1.
 
  The pause parameters you've got for G1 are targets only.  It will *try*
  to stick within those parameters, but if a collection requires more than
  50 milliseconds or has to happen more often than once a second, the
  collector will ignore what you have told it.
 
  Thanks,
  Shawn
 




Re: How to uncache a query to debug?

2013-08-01 Thread Roman Chyla
When you set your cache (solrconfig.xml) to size=0, you are not using a
cache. so you can debug more easily

roman


On Thu, Aug 1, 2013 at 1:12 PM, jimtronic jimtro...@gmail.com wrote:

 I have a query that runs slow occasionally. I'm having trouble debugging it
 because once it's cached, it runs fast -- under 10 ms. But throughout the
 day it occasionally takes up to 3 secs. It seems like it could be one of
 the
 following:

 1. My autoCommit (30 and openSearcher=false) and softAutoCommit (1)
 settings
 2. Something to do with distributed search -- There are three nodes, but
 only 1 shard each.
 3. Just a slow query that is getting blown out of cache periodically

 This is in Solr 4.2.

 I like that it runs fast when cached, but if it's going to be blown out
 quickly, then I'd really like to just optimize the query to run fast
 uncached.

 *Is there any way to run a query using no caching whatsoever?*

 The query changes, but has *:* for the q param and 4 fq parameters. It's
 also trying to do field collapsing.

 Jim




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-uncache-a-query-to-debug-tp4082010.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
Hi, here is a short post describing the results of the yesterday run with
added parameters as per Shawn's recommendation, have fun getting confused ;)

http://29min.wordpress.com/2013/08/01/measuring-solr-performance-ii/

roman


On Wed, Jul 31, 2013 at 12:32 PM, Roman Chyla roman.ch...@gmail.com wrote:

 I'll try to run it with the new parameters and let you know how it goes.
 I've rechecked details for the G1 (default) garbage collector run and I can
 confirm that 2 out of 3 runs were showing high max response times, in some
 cases even 10secs, but the customized G1 never - so definitely the
 parameters had effect because the max time for the customized G1 never went
 higher than 1.5secs (and that happend for 2 query classes only). Both the
 cms-custom and G1-custom are similar, the G1 seems to have higher values in
 the max fields, but that may be random. So, yes, now I am sure what to
 think of default G1 as 'bad', and that these G1 parameters, even if they
 don't seem G1 specific, have real effect.
 Thanks,

 roman


 On Tue, Jul 30, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/30/2013 6:59 PM, Roman Chyla wrote:
  I have been wanting some tools for measuring performance of SOLR,
 similar
  to Mike McCandles' lucene benchmark.
 
  so yet another monitor was born, is described here:
  http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/
 
  I tested it on the problem of garbage collectors (see the blogs for
  details) and so far I can't conclude whether highly customized G1 is
 better
  than highly customized CMS, but I think interesting details can be seen
  there.
 
  Hope this helps someone, and of course, feel free to improve the tool
 and
  share!

 I have a CMS config that's even more tuned than before, and it has made
 things MUCH better.  This new config is inspired by more info that I got
 on IRC:

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

 The G1 customizations in your blog post don't look like they are really
 G1-specific - they may be useful with CMS as well.  This statement also
 applies to some of the CMS parameters, so I would use those with G1 as
 well for any testing.

 UseNUMA looks interesting for machines that actually are NUMA.  All the
 information that I can find says it is only for the throughput
 (parallel) collector, so it's probably not doing anything for G1.

 The pause parameters you've got for G1 are targets only.  It will *try*
 to stick within those parameters, but if a collection requires more than
 50 milliseconds or has to happen more often than once a second, the
 collector will ignore what you have told it.

 Thanks,
 Shawn





Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
On Thu, Aug 1, 2013 at 6:11 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/1/2013 2:08 PM, Roman Chyla wrote:

 Hi, here is a short post describing the results of the yesterday run with
 added parameters as per Shawn's recommendation, have fun getting confused
 ;)

 http://29min.wordpress.com/**2013/08/01/measuring-solr-**performance-ii/http://29min.wordpress.com/2013/08/01/measuring-solr-performance-ii/


 I am having a very difficult time with the graphs.  I have no idea what
 I'm looking at.  The graphs are probably self-explanatory to you, because
 you created them and you've been staring at them for hours. There are both
 lines and shaded areas, and I can't tell what they mean.


I know :) but I am rather investing time in preparing a better test,
because as you said, worst case is the aim - and I would like to trigger
the worst case (btw, all these remaining GC configs have comparable max
execution time of less than 1.5s - that is the worst case in their case so
far and with so a few measurements, there is no meaningful analysis of
significance between them). But when I look at the heights of the areas, in
the charts, the higher means worse - so the yellow seems to be the worst
(g1-custom), your preferred configuration (i think it was cms-x1, green)
seems better than g1-custom. But 'SEEMS' is an important qualifier here



 Tables with numbers, if they have a good legend, would be awesome.


tables are there, just hidden, you would have to run it - the code is there
as well...



 One thing I'd like to see, and when I have some time of my own I will do
 some comprehensive long-term comparisons on production systems, is to see
 what adding or changing *one* GC tuning parameter at a time does, so I can
 find the ideal settings and have some idea of which settings make the most
 difference.

 My concern with garbage collection tuning has been mostly worst-case
 scenario pauses.  I certainly do want averages to come down, but it's
 really the worst-case that concerns me.

 Let's say that one of my typical queries takes 100 milliseconds on average
 with my GC config.  Somebody comes up with another GC config that makes the
 same query take 25 milliseconds or less on average.  If that config also
 results in rare stop-the-world garbage collections that take 5 full
 seconds, I won't be using it.  I'd rather deal with the slower average
 queries than the GC pause problems.


exactly



 I had to let my production systems run for days with jHiccup before I
 really noticed that I had a GC pause problem.  I've since learned that if I
 look at GC logs with GCLogViewer, I can get much the same information.


well, and instead of days, I think it is possible to trigger the worst case
scenario in a matter of hours (but that is my conjecture, to be proven
wrong... ;))

roman



 Thanks,
 Shawn




Re: Measuring SOLR performance

2013-08-02 Thread Roman Chyla
Hi Dmitry,

Thanks, It was a toothing problem, fixed now, please try the fresh checkout
AND add the following to your arguments: -t /solr/core1

that sets the path under which solr should be contacted, the handler is set
in the jmeter configuration, so if you were using different query handlers
than /select, it should be edited there (SolrQueryTest.jmx)

I hope it works this time, the script is trying to guess the admin page
(when one cannot be contacted - but if the new solr introduces some new
paths, i may be wrong - i am short on time to investigate deeper)


roman


On Fri, Aug 2, 2013 at 7:27 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 Sure:

 python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries -s localhost
 -p 8983 -a --durationInSecs 60 -R test

 This is vanilla install (git clone) except for one change that I had to do
 related to solr cores:

  git diff
 diff --git a/solrjmeter.py b/solrjmeter.py
 index d18145a..7a0d2af 100644
 --- a/solrjmeter.py
 +++ b/solrjmeter.py
 @@ -129,7 +129,7 @@ def check_options(options, args):
  if not options.serverName and not options.serverPort:
  error(You must specify both server and port)

 -options.query_endpoint = 'http://%s:%s/solr' % (options.serverName,
 options.serverPort)
 +options.query_endpoint = 'http://%s:%s/solr/core1' %
 (options.serverName, options.serverPort)

  jmx_options = []
  for k, v in options.__dict__.items():



 Dmitry


 On Thu, Aug 1, 2013 at 6:41 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Dmitry,
  Can you post the entire invocation line?
  roman
 
 
  On Thu, Aug 1, 2013 at 7:46 AM, Dmitry Kan solrexp...@gmail.com wrote:
 
   Hi Roman,
  
   When I try to run with -q
   /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries
  
   here what is reported:
   Traceback (most recent call last):
 File solrjmeter.py, line 1390, in module
   main(sys.argv)
 File solrjmeter.py, line 1309, in main
   tests = find_tests(options)
 File solrjmeter.py, line 461, in find_tests
   with changed_dir(pattern):
 File /usr/lib/python2.7/contextlib.py, line 17, in __enter__
   return self.gen.next()
 File solrjmeter.py, line 229, in changed_dir
   os.chdir(new)
   OSError: [Errno 20] Not a directory:
   '/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries'
  
   Best,
  
   Dmitry
  
  
  
   On Wed, Jul 31, 2013 at 7:21 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi Dmitry,
probably mistake in the readme, try calling it with -q
/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries
   
as for the base_url, i was testing it on solr4.0, where it tries
   contactin
/solr/admin/system - is it different for 4.3? I guess I should make
 it
configurable (it already is, the endpoint is set at the
  check_options())
   
thanks
   
roman
   
   
On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com
   wrote:
   
 Ok, got the error fixed by modifying the base solr ulr in
  solrjmeter.py
 (added core name after /solr part).
 Next error is:

 WARNING: no test name(s) supplied nor found in:
 ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries']

 It is a 'slow start with new tool' symptom I guess.. :)


 On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com
wrote:

 Hi Roman,

 What  version and config of SOLR does the tool expect?

 Tried to run, but got:

 **ERROR**
   File solrjmeter.py, line 1390, in module
 main(sys.argv)
   File solrjmeter.py, line 1296, in main
 check_prerequisities(options)
   File solrjmeter.py, line 351, in check_prerequisities
 error('Cannot contact: %s' % options.query_endpoint)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 Cannot contact: http://localhost:8983/solr


 complains about URL, clicking which leads properly to the admin
   page...
 solr 4.3.1, 2 cores shard

 Dmitry


 On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla 
 roman.ch...@gmail.com
wrote:

 Hello,

 I have been wanting some tools for measuring performance of SOLR,
similar
 to Mike McCandles' lucene benchmark.

 so yet another monitor was born, is described here:

   
  http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/

 I tested it on the problem of garbage collectors (see the blogs
 for
 details) and so far I can't conclude whether highly customized G1
  is
 better
 than highly customized CMS, but I think interesting details can
 be
   seen
 there.

 Hope this helps someone, and of course, feel free to improve the
  tool
and
 share!

 roman




   
  
 



Re: Measuring SOLR performance

2013-08-05 Thread Roman Chyla
Hi Dmitry,
So I think the admin pages are different on your version of solr, what do
you see when you request... ?

http://localhost:8983/solr/admin/system?wt=json
http://localhost:8983/solr/admin/mbeans?wt=json
http://localhost:8983/solr/admin/cores?wt=json

If your core -t was '/solr/statements', the script should assume admin is
at: /solr/admin (the script checks for /admin/system/cores - so that url
already exists), thus I am guessing /admin/system is not there.

if you can, please check out the latest version - the script will print its
environment, the 'admin_endpoint' is the one that we are interested in.
I'll update the docs, btw. you may want to use '-e statements' to indicate
what core_name you want to harvest details for

thanks,

roman







On Mon, Aug 5, 2013 at 6:22 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 No problem. Still trying to launch the thing..

 The query with the added -t parameter generated an error:

 1. python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R
 test -t /solr/statements   [passed relative path to -q param]

 (as you can see I added -t param and made -q param simpler)

 Traceback (most recent call last):
   File solrjmeter.py, line 1425, in module
 main(sys.argv)
   File solrjmeter.py, line 1379, in main
 before_test = harvest_details_about_montysolr(options)
   File solrjmeter.py, line 505, in harvest_details_about_montysolr
 system_data = req('%s/system' % options.admin_endpoint)
   File solrjmeter.py, line 113, in req
 raise r
 simplejson.decoder.JSONDecodeError: No JSON object could be decoded: line 1
 column 0 (char 0)


 The README.md on the github is somehow outdated, it suggests using -q
 ./demo/queries/demo.queries, but there is no such path in the fresh
 checkout.

 Nice to have the -t param.

 Dmitry


 On Sat, Aug 3, 2013 at 5:01 AM, Roman Chyla roman.ch...@gmail.com wrote:

  Hi Dmitry,
 
  Thanks, It was a toothing problem, fixed now, please try the fresh
 checkout
  AND add the following to your arguments: -t /solr/core1
 
  that sets the path under which solr should be contacted, the handler is
 set
  in the jmeter configuration, so if you were using different query
 handlers
  than /select, it should be edited there (SolrQueryTest.jmx)
 
  I hope it works this time, the script is trying to guess the admin page
  (when one cannot be contacted - but if the new solr introduces some new
  paths, i may be wrong - i am short on time to investigate deeper)
 
 
  roman
 
 
  On Fri, Aug 2, 2013 at 7:27 AM, Dmitry Kan solrexp...@gmail.com wrote:
 
   Hi Roman,
  
   Sure:
  
   python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
   /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries -s
  localhost
   -p 8983 -a --durationInSecs 60 -R test
  
   This is vanilla install (git clone) except for one change that I had to
  do
   related to solr cores:
  
git diff
   diff --git a/solrjmeter.py b/solrjmeter.py
   index d18145a..7a0d2af 100644
   --- a/solrjmeter.py
   +++ b/solrjmeter.py
   @@ -129,7 +129,7 @@ def check_options(options, args):
if not options.serverName and not options.serverPort:
error(You must specify both server and port)
  
   -options.query_endpoint = 'http://%s:%s/solr' %
 (options.serverName,
   options.serverPort)
   +options.query_endpoint = 'http://%s:%s/solr/core1' %
   (options.serverName, options.serverPort)
  
jmx_options = []
for k, v in options.__dict__.items():
  
  
  
   Dmitry
  
  
   On Thu, Aug 1, 2013 at 6:41 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
Dmitry,
Can you post the entire invocation line?
roman
   
   
On Thu, Aug 1, 2013 at 7:46 AM, Dmitry Kan solrexp...@gmail.com
  wrote:
   
 Hi Roman,

 When I try to run with -q
 /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries

 here what is reported:
 Traceback (most recent call last):
   File solrjmeter.py, line 1390, in module
 main(sys.argv)
   File solrjmeter.py, line 1309, in main
 tests = find_tests(options)
   File solrjmeter.py, line 461, in find_tests
 with changed_dir(pattern):
   File /usr/lib/python2.7/contextlib.py, line 17, in __enter__
 return self.gen.next()
   File solrjmeter.py, line 229, in changed_dir
 os.chdir(new)
 OSError: [Errno 20] Not a directory:
 '/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries'

 Best,

 Dmitry



 On Wed, Jul 31, 2013 at 7:21 PM, Roman Chyla 
 roman.ch...@gmail.com
 wrote:

  Hi Dmitry,
  probably mistake in the readme, try calling it with -q
  /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries
 
  as for the base_url, i was testing it on solr4.0, where it tries
 contactin
  /solr/admin/system - is it different for 4.3? I guess I should

Re: Measuring SOLR performance

2013-08-06 Thread Roman Chyla
Hi Dmitry,

I've modified the solrjmeter to retrieve data from under the core (the -t
parameter) and the rest from the /solr/admin - I could test it only against
4.0, but it is there the same as 4.3 - it seems...so you can try the fresh
checkout

my test was: python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -t
/solr/collection1 -R foo -q ./queries/demo/* -p 9002 -s adsate

Thanks!

roman


On Tue, Aug 6, 2013 at 9:46 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi,

 Thanks for the clarification, Shawn!

 So with this in mind, the following work:

 http://localhost:8983/solr/statements/admin/system?wt=json
 http://localhost:8983/solr/statements/admin/mbeans?wt=json

 not copying their output to save space.

 Roman:

 is this something that should be set via -t parameter as well?

 Dmitry



 On Tue, Aug 6, 2013 at 4:34 PM, Shawn Heisey s...@elyograg.org wrote:

  On 8/6/2013 6:17 AM, Dmitry Kan wrote:
   Of three URLs you asked for, only the 3rd one gave response:
  snip
   The rest report 404.
  
   On Mon, Aug 5, 2013 at 8:38 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
   Hi Dmitry,
   So I think the admin pages are different on your version of solr, what
  do
   you see when you request... ?
  
   http://localhost:8983/solr/admin/system?wt=json
   http://localhost:8983/solr/admin/mbeans?wt=json
   http://localhost:8983/solr/admin/cores?wt=json
 
  Unless you have a valid defaultCoreName set in your (old-style)
  solr.xml, the first two URLs won't work, as you've discovered.  Without
  that valid defaultCoreName (or if you wanted info from a different
  core), you'd need to add a core name to the URL for them to work.
 
  The third one, which works for you, is a global handler for manipulating
  cores, so naturally it doesn't need a core name to function.  The URL
  path for this handler is defined by solr.xml.
 
  Thanks,
  Shawn
 
 



Re: Measuring SOLR performance

2013-08-07 Thread Roman Chyla
Hi Dmitry,
The command seems good. Are you sure your shell is not doing something
funny with the params? You could try:

python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx -a

where g1 and foo are results of the individual runs, ie. something that was
started and saved with '-R g1' and '-R foo' respectively

so, for example, i have these comparisons inside
'/var/lib/montysolr/different-java-settings/solrjmeter', so I am generating
the comparison by:

export SOLRJMETER_HOME=/var/lib/montysolr/different-java-settings/solrjmeter
python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx -a


roman


On Wed, Aug 7, 2013 at 10:03 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 One more question. I tried to compare different runs (g1 vs cms) using the
 command below, but get an error. Should I attach some other param(s)?


 python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx
 **ERROR**
   File solrjmeter.py, line 1427, in module
 main(sys.argv)
   File solrjmeter.py, line 1303, in main
 check_options(options, args)
   File solrjmeter.py, line 185, in check_options
 error(The folder '%s' does not exist % rf)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 The folder '0' does not exist

 Dmitry




 On Wed, Aug 7, 2013 at 4:13 PM, Dmitry Kan solrexp...@gmail.com wrote:

  Hi Roman,
 
  Finally, this has worked! Thanks for quick support.
 
  The graphs look awesome. At least on the index sample :) It is quite easy
  to setup and run + possible to run directly on the shard server in
  background mode.
 
  my test run was:
 
  python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
  ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60
 -R
  foo -t /solr/statements -e statements
 
  Thanks!
 
  Dmitry
 
 
  On Wed, Aug 7, 2013 at 6:54 AM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
  Hi Dmitry,
 
  I've modified the solrjmeter to retrieve data from under the core (the
 -t
  parameter) and the rest from the /solr/admin - I could test it only
  against
  4.0, but it is there the same as 4.3 - it seems...so you can try the
 fresh
  checkout
 
  my test was: python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -t
  /solr/collection1 -R foo -q ./queries/demo/* -p 9002 -s adsate
 
  Thanks!
 
  roman
 
 
  On Tue, Aug 6, 2013 at 9:46 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Hi,
  
   Thanks for the clarification, Shawn!
  
   So with this in mind, the following work:
  
   http://localhost:8983/solr/statements/admin/system?wt=json
   http://localhost:8983/solr/statements/admin/mbeans?wt=json
  
   not copying their output to save space.
  
   Roman:
  
   is this something that should be set via -t parameter as well?
  
   Dmitry
  
  
  
   On Tue, Aug 6, 2013 at 4:34 PM, Shawn Heisey s...@elyograg.org
 wrote:
  
On 8/6/2013 6:17 AM, Dmitry Kan wrote:
 Of three URLs you asked for, only the 3rd one gave response:
snip
 The rest report 404.

 On Mon, Aug 5, 2013 at 8:38 PM, Roman Chyla 
 roman.ch...@gmail.com
wrote:

 Hi Dmitry,
 So I think the admin pages are different on your version of solr,
  what
do
 you see when you request... ?

 http://localhost:8983/solr/admin/system?wt=json
 http://localhost:8983/solr/admin/mbeans?wt=json
 http://localhost:8983/solr/admin/cores?wt=json
   
Unless you have a valid defaultCoreName set in your (old-style)
solr.xml, the first two URLs won't work, as you've discovered.
   Without
that valid defaultCoreName (or if you wanted info from a different
core), you'd need to add a core name to the URL for them to work.
   
The third one, which works for you, is a global handler for
  manipulating
cores, so naturally it doesn't need a core name to function.  The
 URL
path for this handler is defined by solr.xml.
   
Thanks,
Shawn
   
   
  
 
 
 



Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote:

  *All* of the terms in the field must be matched by the querynot
 vice-versa.

 Exactly. This is why I was trying to explain it as a reverse search.

 I just realized I describe it as a *large list of known keywords when
 really its small; no more than 1000. Forgetting about performance  how hard
 do you think this would be to implement? How should I even start?


not hard, index all terms into a field - make sure there are no duplicates,
as you want to count them - then I can imagine at least two options: save
the number of terms as a payload together with the terms, or in second step
(in a collector, for example), load the document and count them terms in
the field - if they match the query size, you are done

a trivial, naive implementation (as you say 'forget performance') could be:

searcher.search(query, null, new Collector() {
  ...
  public void collect(int i) throws Exception {
 d = reader.document(i, fieldsToLoa);
 if (d.getValues(fieldToLoad).size() == query.size()) {
PriorityQueue.add(new ScoreDoc(score, i + docBase));
 }
  }
}

so if your query contains no duplicates and all terms must match, you can
be sure that you are collecting docs only when the number of terms matches
number of clauses in the query

roman


 Thanks for the input

 On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote:

  *All* of the terms in the field must be matched by the querynot
 vice-versa.
  And no, we don't have a query for that out of the box.  To implement,
  it seems like it would require the total number of terms indexed for a
  field (for each document).
  I guess you could also index start and end tokens and then use query
  expansion to all possible combinations... messy though.
 
  -Yonik
  http://lucidworks.com
 
  On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  This _looks_ like simple phrase matching (no slop) and highlighting...
 
  But whenever I think the answer is really simple, it usually means
  that I'm missing something
 
  Best
  Erick
 
 
  On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com
 wrote:
 
  Ok forget the mention of percolate.
 
  We have a large list of known keywords we would like to match against.
 
  Product keyword:  Sony
  Product keyword:  Samsung Galaxy
 
  We would like to be able to detect given a product title whether or
 not it
  matches any known keywords. For a keyword to be matched all of it's
 terms
  must be present in the product title given.
 
  Product Title: Sony Experia
  Matches and returns a highlight: emSony/em Experia
 
  Product Title: Samsung 52inch LC
  Does not match
 
  Product Title: Samsung Galaxy S4
  Matches a returns a highlight: emSamsung Galaxy/em
 
  Product Title: Galaxy Samsung S4
  Matches a returns a highlight: em Galaxy  Samsung/em
 
  What would be the best way to approach this?
 
 
 
 
  On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org
  wrote:
 
 
  : Subject: Percolate feature?
 
  can you give a more concrete, realistic example of what you are
 trying to
  do? your synthetic hypothetical example is kind of hard to make sense
 of.
 
  your Subject line and comment that the percolate feature of elastic
  search sounds like what you want seems to have some lead people down a
  path of assuming you want to run these types of queries as documents
 are
  indexed -- but that isn't at all clear to me from the way you worded
 your
  question other then that.
 
  it's also not clear what aspect of the results you really care
 about --
  are you only looking for the *number* of documents that match
 according
  to your concept of matching, or are you looking for a list of matches?
  what multiple documents have all of their terms in the query string --
  how
  should they score relative to eachother?  what if a document contains
 the
  same term multiple times, do you expect it to be a match of a query
 only
  if that term appears in the query multiple times as well?  do you care
  about hte ordering of the terms in the query? the ordering of hte
 terms
  in
  the document?
 
  Ideally: describe for us what you wnat to do, w/o assuming
  solr/elasticsearch/anything specific about the implementation -- just
  describe your actual use case for us, with several real document/query
  examples.
 
 
 
  https://people.apache.org/~hossman/#xyproblem
  XY Problem
 
  Your question appears to be an XY Problem ... that is: you are
 dealing
  with X, you are assuming Y will help you, and you are asking about
  Y
  without giving more details about the X so that we can understand
 the
  full issue.  Perhaps the best solution doesn't involve Y at all?
  See Also: http://www.perlmonks.org/index.pl?node_id=542341
 
 
 
 
 
 
  -Hoss
 
 




Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I'll look into this. Thanks for the concrete example as I don't even
 : know which classes to start to look at to implement such a feature.

 Either roman isn't understanding what you are aksing for, or i'm not --
 but i don't think what roman described will work for you...

 :  so if your query contains no duplicates and all terms must match, you
 can
 :  be sure that you are collecting docs only when the number of terms
 matches
 :  number of clauses in the query

 several of the examples you gave did not match what Roman is describing,
 as i understand it.  Most people on this thread seem to be getting
 confused by having their perceptions flipped about what your data known
 in advance is vs the data you get at request time.

 You described this...

 :  Product keyword:  Sony
 :  Product keyword:  Samsung Galaxy
 : 
 :  We would like to be able to detect given a product title whether or
 :  not it
 :  matches any known keywords. For a keyword to be matched all of it's
 :  terms
 :  must be present in the product title given.
 : 
 :  Product Title: Sony Experia
 :  Matches and returns a highlight: emSony/em Experia

 ...suggesting that what you call product keywords are the data you know
 about in advance and product titles are the data you get at request
 time.

 So your example of the request time input (ie: query) Sony Experia
 matching data known in advance (ie: indexed document) Sony would not
 work with Roman's example.

 To rephrase (what i think i understand is) your goal...

  * you have many (10*3+) documents known in advance
  * any document D contain a set of words W(D) of varing sizes
  * any requests Q contains a set of words W(Q) of varing izes
  * you want a given request R to match a document D if and only if:
- W(D) is a subset of W(Q)


aha! this was not what i was understanding! i was assuming W(Q) is a subset
of W(D) - or rather, W(Q) === W(D)

so now i finally see the reasoning behind it and the use case, which is a
VERY interesting one.

roman



- ie: no iten exists in W(D) that does not exist in W(Q)
- ie: any number of items may exist in W(Q) that are not in W(D)





 So to reiteratve your examples from before, but change the labels a
 bit and add some more converse examples (and ignore the highlighting
 aspect for a moment...

 doc1 = Sony
 doc2 = Samsung Galaxy
 doc3 = Sony Playstation

 queryA = Sony Experia   ... matches only doc1
 queryB = Sony Playstation 3 ... matches doc3 and doc1
 queryC = Samsung 52inch LC  ... doesn't match anything
 queryD = Samsung Galaxy S4  ... matches doc2
 queryE = Galaxy Samsung S4  ... matches doc2


 ...do i still have that correct?


 A similar question came up in the past, but i can't find my response now
 so i'll try to recreate it ...


 1) if you don't care about using non-trivial analysis (ie: you don't need
 stemming, or synonyms, etc..), you can do this with some
 really simple function queries -- asusming you index a field containing
 hte number of words in each document, in addition to the words
 themselves.  Assuming your words are in a field named words and the
 number of words is in a field named words_count a request for something
 like Galaxy Samsung S4 can be represented as...

   q={!frange l=0 u=0}sub(words_count,
  sum(termfreq('words','Galaxy'),
  termfreq('words','Samsung'),
  termfreq('words','S4'))

 ...ie: you want to compute the sub of the term frequencies for each of
 hte words requested, and then you want ot subtract that sum from the
 number of terms in the documengt -- and then you only want ot match
 documents where the result of that subtraction is 0.

 one complexity that comes up, is that you haven't specified:

   * can the list of words in your documents contain duplicates?
   * can the list of words in your query contain duplicates?
   * should a document with duplicatewords match only if the query also
 contains the same word duplicated?

 ...the answers to those questions make hte math more complicated (and are
 left as an excersize for the reader)


 2) if you *do* care about using non-trivial analysis, then you can't use
 the simple termfreq() function, which deals with raw terms -- in stead
 you have to use the query() function to ensure that the input is parsed
 appropriately -- but then you have to wrap that function in something that
 will normalize the scores - so in place of termfreq('words','Galaxy')
 you'd want something like...

 if(query({!field f=words v='Galaxy'}),1,0)

 ...but again the math gets much harder if you make things more complex
 with duplicate words i nthe document or duplicate words in the query --
 you'd
 probably have to use a custom similarity to get the scores returned by the
 query() function to be usable as is in the match equation (and drop the
 if() function)


 As for the 

Re: Measuring SOLR performance

2013-08-12 Thread Roman Chyla
Hi Dmitry,



On Mon, Aug 12, 2013 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 Good point. I managed to run the command with -C and double quotes:

 python solrjmeter.py -a -C g1,cms -c hour -x ./jmx/SolrQueryTest.jmx

 As a result got several files (html, css, js, csv) in the running directory
 (any way to specify where the output should be stored in this case?)


i know it is confusing, i plan to change it - but later, now it is too busy
here...



 When I look onto the comparison dashboard, I see this:

 http://pbrd.co/17IRI0b


two things: the tests probably took more than one hour to finish, so they
are not aligned - try generating the comparison with '-c  14400'  (ie.
4x3600 secs)

the other thing: if you have only two datapoints, the dygraph will not show
anything - there must be more datapoints/measurements




 One more thing: all the previous tests were run with softCommit disabled.
 After enabling it, the tests started to fail:

 $ python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R
 g1 -t /solr/statements -e statements -U 100
 $ cd g1
 Reading results of the previous test
 $ cd 2013.08.12.16.32.48
 $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter/g1
 $ mkdir 2013.08.12.16.33.02
 $ cd 2013.08.12.16.33.02
 $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter/g1
 $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter
 $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter
 Traceback (most recent call last):
   File solrjmeter.py, line 1427, in module
 main(sys.argv)
   File solrjmeter.py, line 1381, in main
 before_test = harvest_details_about_montysolr(options)
   File solrjmeter.py, line 562, in harvest_details_about_montysolr
 indexLstModified = cores_data['status'][cn]['index']['lastModified'],
 KeyError: 'lastModified'


Thanks for letting me know, that info is probably not available in this
situation - i've cooked st quick to fix it, please try the latest commit
(hope it doesn't do more harm, i should get some sleep ..;))

roman



 In case it matters:  Python 2.7.3, ubuntu, solr 4.3.1.

 Thanks,

 Dmitry


 On Thu, Aug 8, 2013 at 2:22 AM, Roman Chyla roman.ch...@gmail.com wrote:

  Hi Dmitry,
  The command seems good. Are you sure your shell is not doing something
  funny with the params? You could try:
 
  python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx -a
 
  where g1 and foo are results of the individual runs, ie. something that
 was
  started and saved with '-R g1' and '-R foo' respectively
 
  so, for example, i have these comparisons inside
  '/var/lib/montysolr/different-java-settings/solrjmeter', so I am
 generating
  the comparison by:
 
  export
  SOLRJMETER_HOME=/var/lib/montysolr/different-java-settings/solrjmeter
  python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx -a
 
 
  roman
 
 
  On Wed, Aug 7, 2013 at 10:03 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Hi Roman,
  
   One more question. I tried to compare different runs (g1 vs cms) using
  the
   command below, but get an error. Should I attach some other param(s)?
  
  
   python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx
   **ERROR**
 File solrjmeter.py, line 1427, in module
   main(sys.argv)
 File solrjmeter.py, line 1303, in main
   check_options(options, args)
 File solrjmeter.py, line 185, in check_options
   error(The folder '%s' does not exist % rf)
 File solrjmeter.py, line 66, in error
   traceback.print_stack()
   The folder '0' does not exist
  
   Dmitry
  
  
  
  
   On Wed, Aug 7, 2013 at 4:13 PM, Dmitry Kan solrexp...@gmail.com
 wrote:
  
Hi Roman,
   
Finally, this has worked! Thanks for quick support.
   
The graphs look awesome. At least on the index sample :) It is quite
  easy
to setup and run + possible to run directly on the shard server in
background mode.
   
my test run was:
   
python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs
 60
   -R
foo -t /solr/statements -e statements
   
Thanks!
   
Dmitry
   
   
On Wed, Aug 7, 2013 at 6:54 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
Hi Dmitry,
   
I've modified the solrjmeter to retrieve data from under the core
 (the
   -t
parameter) and the rest from the /solr/admin - I could test it only
against
4.0, but it is there the same as 4.3 - it seems...so you can try the
   fresh
checkout
   
my test was: python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -t
/solr/collection1 -R foo -q ./queries/demo/* -p 9002 -s adsate
   
Thanks!
   
roman
   
   
On Tue, Aug 6, 2013 at 9:46 AM, Dmitry Kan solrexp...@gmail.com
   wrote:
   
 Hi,

 Thanks for the clarification, Shawn!

 So with this in mind, the following work:

 http://localhost:8983/solr/statements/admin

Re: Measuring SOLR performance

2013-08-13 Thread Roman Chyla
Hi Dmitry, oh yes, late night fixes... :) The latest commit should make it
work for you.
Thanks!

roman


On Tue, Aug 13, 2013 at 3:37 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 Something bad happened in fresh checkout:

 python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R
 cms -t /solr/statements -e statements -U 100

 Traceback (most recent call last):
   File solrjmeter.py, line 1392, in module
 main(sys.argv)
   File solrjmeter.py, line 1347, in main
 save_into_file('before-test.json', simplejson.dumps(before_test))
   File /usr/lib/python2.7/dist-packages/simplejson/__init__.py, line 286,
 in dumps
 return _default_encoder.encode(obj)
   File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 226,
 in encode
 chunks = self.iterencode(o, _one_shot=True)
   File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 296,
 in iterencode
 return _iterencode(o, 0)
   File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line 202,
 in default
 raise TypeError(repr(o) +  is not JSON serializable)
 TypeError: __main__.ForgivingValue object at 0x7fc6d4040fd0 is not JSON
 serializable


 Regards,

 D.


 On Tue, Aug 13, 2013 at 8:10 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Dmitry,
 
 
 
  On Mon, Aug 12, 2013 at 9:36 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Hi Roman,
  
   Good point. I managed to run the command with -C and double quotes:
  
   python solrjmeter.py -a -C g1,cms -c hour -x ./jmx/SolrQueryTest.jmx
  
   As a result got several files (html, css, js, csv) in the running
  directory
   (any way to specify where the output should be stored in this case?)
  
 
  i know it is confusing, i plan to change it - but later, now it is too
 busy
  here...
 
 
  
   When I look onto the comparison dashboard, I see this:
  
   http://pbrd.co/17IRI0b
  
 
  two things: the tests probably took more than one hour to finish, so they
  are not aligned - try generating the comparison with '-c  14400'  (ie.
  4x3600 secs)
 
  the other thing: if you have only two datapoints, the dygraph will not
 show
  anything - there must be more datapoints/measurements
 
 
 
  
   One more thing: all the previous tests were run with softCommit
 disabled.
   After enabling it, the tests started to fail:
  
   $ python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
   ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60
  -R
   g1 -t /solr/statements -e statements -U 100
   $ cd g1
   Reading results of the previous test
   $ cd 2013.08.12.16.32.48
   $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter/g1
   $ mkdir 2013.08.12.16.33.02
   $ cd 2013.08.12.16.33.02
   $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter/g1
   $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter
   $ cd /home/dmitry/projects/lab/solrjmeter4/solrjmeter
   Traceback (most recent call last):
 File solrjmeter.py, line 1427, in module
   main(sys.argv)
 File solrjmeter.py, line 1381, in main
   before_test = harvest_details_about_montysolr(options)
 File solrjmeter.py, line 562, in harvest_details_about_montysolr
   indexLstModified =
 cores_data['status'][cn]['index']['lastModified'],
   KeyError: 'lastModified'
  
 
  Thanks for letting me know, that info is probably not available in this
  situation - i've cooked st quick to fix it, please try the latest commit
  (hope it doesn't do more harm, i should get some sleep ..;))
 
  roman
 
 
  
   In case it matters:  Python 2.7.3, ubuntu, solr 4.3.1.
  
   Thanks,
  
   Dmitry
  
  
   On Thu, Aug 8, 2013 at 2:22 AM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
Hi Dmitry,
The command seems good. Are you sure your shell is not doing
 something
funny with the params? You could try:
   
python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx
 -a
   
where g1 and foo are results of the individual runs, ie. something
 that
   was
started and saved with '-R g1' and '-R foo' respectively
   
so, for example, i have these comparisons inside
'/var/lib/montysolr/different-java-settings/solrjmeter', so I am
   generating
the comparison by:
   
export
SOLRJMETER_HOME=/var/lib/montysolr/different-java-settings/solrjmeter
python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx
 -a
   
   
roman
   
   
On Wed, Aug 7, 2013 at 10:03 AM, Dmitry Kan solrexp...@gmail.com
   wrote:
   
 Hi Roman,

 One more question. I tried to compare different runs (g1 vs cms)
  using
the
 command below, but get an error. Should I attach some other
 param(s)?


 python solrjmeter.py -C g1,foo -c hour -x ./jmx/SolrQueryTest.jmx
 **ERROR**
   File solrjmeter.py, line 1427, in module
 main(sys.argv)
   File solrjmeter.py, line 1303, in main
 check_options(options, args)
   File solrjmeter.py, line 185

Re: Measuring SOLR performance

2013-08-22 Thread Roman Chyla
Hi Dmitry,
So it seems solrjmeter should not assume the adminPath - and perhaps needs
to be passed as an argument. When you set the adminPath, are you able to
access localhost:8983/solr/statements/admin/cores ?

roman


On Wed, Aug 21, 2013 at 7:36 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 I have noticed a difference with different solr.xml config contents. It is
 probably legit, but thought to let you know (tests run on fresh checkout as
 of today).

 As mentioned before, I have two cores configured in solr.xml. If the file
 is:

 [code]
 solr persistent=false

   !--
   adminPath: RequestHandler path to manage cores.
 If 'null' (or absent), cores will not be manageable via request handler
   --
   cores adminPath=/admin/cores host=${host:}
 hostPort=${jetty.port:8983} hostContext=${hostContext:solr}
 core name=metadata instanceDir=metadata /
 core name=statements instanceDir=statements /
   /cores
 /solr
 [/code]

 then the instruction:

 python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 -R
 cms -t /solr/statements -e statements -U 100

 works just fine. If however the solr.xml has adminPath set to /admin
 solrjmeter produces an error:

 [error]
 **ERROR**
   File solrjmeter.py, line 1386, in module
 main(sys.argv)
   File solrjmeter.py, line 1278, in main
 check_prerequisities(options)
   File solrjmeter.py, line 375, in check_prerequisities
 error('Cannot find admin pages: %s, please report a bug' % apath)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 Cannot find admin pages: http://localhost:8983/solr/admin, please report a
 bug
 [/error]

 With both solr.xml configs the following url returns just fine:

 http://localhost:8983/solr/statements/admin/system?wt=json

 Regards,

 Dmitry



 On Wed, Aug 14, 2013 at 2:03 PM, Dmitry Kan solrexp...@gmail.com wrote:

  Hi Roman,
 
  This looks much better, thanks! The ordinary non-comarison mode works.
  I'll post here, if there are other findings.
 
  Thanks for quick turnarounds,
 
  Dmitry
 
 
  On Wed, Aug 14, 2013 at 1:32 AM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
  Hi Dmitry, oh yes, late night fixes... :) The latest commit should make
 it
  work for you.
  Thanks!
 
  roman
 
 
  On Tue, Aug 13, 2013 at 3:37 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 
   Hi Roman,
  
   Something bad happened in fresh checkout:
  
   python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
   ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs
 60
  -R
   cms -t /solr/statements -e statements -U 100
  
   Traceback (most recent call last):
 File solrjmeter.py, line 1392, in module
   main(sys.argv)
 File solrjmeter.py, line 1347, in main
   save_into_file('before-test.json', simplejson.dumps(before_test))
 File /usr/lib/python2.7/dist-packages/simplejson/__init__.py, line
  286,
   in dumps
   return _default_encoder.encode(obj)
 File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line
  226,
   in encode
   chunks = self.iterencode(o, _one_shot=True)
 File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line
  296,
   in iterencode
   return _iterencode(o, 0)
 File /usr/lib/python2.7/dist-packages/simplejson/encoder.py, line
  202,
   in default
   raise TypeError(repr(o) +  is not JSON serializable)
   TypeError: __main__.ForgivingValue object at 0x7fc6d4040fd0 is not
  JSON
   serializable
  
  
   Regards,
  
   D.
  
  
   On Tue, Aug 13, 2013 at 8:10 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi Dmitry,
   
   
   
On Mon, Aug 12, 2013 at 9:36 AM, Dmitry Kan solrexp...@gmail.com
   wrote:
   
 Hi Roman,

 Good point. I managed to run the command with -C and double
 quotes:

 python solrjmeter.py -a -C g1,cms -c hour -x
  ./jmx/SolrQueryTest.jmx

 As a result got several files (html, css, js, csv) in the running
directory
 (any way to specify where the output should be stored in this
 case?)

   
i know it is confusing, i plan to change it - but later, now it is
 too
   busy
here...
   
   

 When I look onto the comparison dashboard, I see this:

 http://pbrd.co/17IRI0b

   
two things: the tests probably took more than one hour to finish, so
  they
are not aligned - try generating the comparison with '-c  14400'
  (ie.
4x3600 secs)
   
the other thing: if you have only two datapoints, the dygraph will
 not
   show
anything - there must be more datapoints/measurements
   
   
   

 One more thing: all the previous tests were run with softCommit
   disabled.
 After enabling it, the tests started to fail:

 $ python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a
  --durationInSecs 60
-R
 g1 -t /solr/statements -e statements -U 100
 $ cd g1
 Reading

Re: Measuring SOLR performance

2013-09-02 Thread Roman Chyla
Hi Dmitry,

If it is something you want to pass with every request (which is my use
case), you can pass it as additional solr params, eg.

python solrjmeter
--additionalSolrParams=fq=other_field:bar+facet=true+facet.field=facet_field_name


the string should be url encoded.

If it is something that changes with every request, you should modify the
jmeter test. If you open/load it with jmeter GUI, in the HTTP request
processor you can define other additional fields to pass with the request.
These values can come from the CSV file, you'll see an example how to use
that when you open the test difinition file.

Cheers,

  roman




On Mon, Sep 2, 2013 at 3:12 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Erick,

 Agree, this is perfectly fine to mix them in solr. But my question is about
 solrjmeter input query format. Just couldn't find a suitable example on the
 solrjmeter's github.

 Dmitry



 On Mon, Sep 2, 2013 at 5:40 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  filter and facet queries can be freely intermixed, it's not a problem.
  What problem are you seeing when you try this?
 
  Best,
  Erick
 
 
  On Mon, Sep 2, 2013 at 7:46 AM, Dmitry Kan solrexp...@gmail.com wrote:
 
   Hi Roman,
  
   What's the format for running the facet+filter queries?
  
   Would something like this work:
  
   field:foo  =50  fq=other_field:bar facet=true
  facet.field=facet_field_name
  
  
   Thanks,
   Dmitry
  
  
  
   On Fri, Aug 23, 2013 at 2:34 PM, Dmitry Kan solrexp...@gmail.com
  wrote:
  
Hi Roman,
   
With adminPath=/admin or adminPath=/admin/cores, no.
 Interestingly
enough, though, I can access
http://localhost:8983/solr/statements/admin/system
   
But I can access http://localhost:8983/solr/admin/cores, only when
  with
adminPath=/admin/cores (which suggests that this is the right value
  to
   be
used for cores), and not with adminPath=/admin.
   
Bottom line, these core configuration is not self-evident.
   
Dmitry
   
   
   
   
On Fri, Aug 23, 2013 at 4:18 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
Hi Dmitry,
So it seems solrjmeter should not assume the adminPath - and perhaps
   needs
to be passed as an argument. When you set the adminPath, are you
 able
  to
access localhost:8983/solr/statements/admin/cores ?
   
roman
   
   
On Wed, Aug 21, 2013 at 7:36 AM, Dmitry Kan solrexp...@gmail.com
   wrote:
   
 Hi Roman,

 I have noticed a difference with different solr.xml config
 contents.
   It
is
 probably legit, but thought to let you know (tests run on fresh
checkout as
 of today).

 As mentioned before, I have two cores configured in solr.xml. If
 the
file
 is:

 [code]
 solr persistent=false

   !--
   adminPath: RequestHandler path to manage cores.
 If 'null' (or absent), cores will not be manageable via
 request
handler
   --
   cores adminPath=/admin/cores host=${host:}
 hostPort=${jetty.port:8983} hostContext=${hostContext:solr}
 core name=metadata instanceDir=metadata /
 core name=statements instanceDir=statements /
   /cores
 /solr
 [/code]

 then the instruction:

 python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q
 ./queries/demo/demo.queries -s localhost -p 8983 -a
 --durationInSecs
   60
-R
 cms -t /solr/statements -e statements -U 100

 works just fine. If however the solr.xml has adminPath set to
  /admin
 solrjmeter produces an error:

 [error]
 **ERROR**
   File solrjmeter.py, line 1386, in module
 main(sys.argv)
   File solrjmeter.py, line 1278, in main
 check_prerequisities(options)
   File solrjmeter.py, line 375, in check_prerequisities
 error('Cannot find admin pages: %s, please report a bug' %
  apath)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 Cannot find admin pages: http://localhost:8983/solr/admin, please
report a
 bug
 [/error]

 With both solr.xml configs the following url returns just fine:

 http://localhost:8983/solr/statements/admin/system?wt=json

 Regards,

 Dmitry



 On Wed, Aug 14, 2013 at 2:03 PM, Dmitry Kan solrexp...@gmail.com
 
wrote:

  Hi Roman,
 
  This looks much better, thanks! The ordinary non-comarison mode
   works.
  I'll post here, if there are other findings.
 
  Thanks for quick turnarounds,
 
  Dmitry
 
 
  On Wed, Aug 14, 2013 at 1:32 AM, Roman Chyla 
  roman.ch...@gmail.com
 wrote:
 
  Hi Dmitry, oh yes, late night fixes... :) The latest commit
  should
make
 it
  work for you.
  Thanks!
 
  roman
 
 
  On Tue, Aug 13, 2013 at 3:37 AM, Dmitry Kan 
  solrexp...@gmail.com
 wrote:
 
   Hi Roman,
  
   Something bad happened in fresh

Re: Measuring SOLR performance

2013-09-03 Thread Roman Chyla
Hi Dmitry,

Thanks for the feedback. Yes, it is indeed jmeter issue (or rather, the
issue of the plugin we use to generate charts). You may want to use the
github for whatever comes next

https://github.com/romanchyla/solrjmeter/issues

Cheers,

  roman


On Tue, Sep 3, 2013 at 7:54 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 Thanks, the --additionalSolrParams was just what I wanted and works fine.

 BTW, if you have some special bug tracking forum for the tool, I'm happy
 to submit questions / bug reports there. Otherwise, this email list is ok
 (for me at least).

 One other thing I have noticed in the err logs was a series of messages of
 this sort upon generating the perf test report. Seems to be jmeter related
 (the err messages disappear, if extra lib dir is present under ext
 directory).

 java.lang.Throwable: Could not access
 /home/dmitry/projects/lab/solrjmeter7/solrjmeter/jmeter/lib/ext/lib
 at
 kg.apc.cmd.UniversalRunner.buildUpdatedClassPath(UniversalRunner.java:109)
 at kg.apc.cmd.UniversalRunner.clinit(UniversalRunner.java:55)
 at
 kg.apc.cmd.UniversalRunner.buildUpdatedClassPath(UniversalRunner.java:109)
 at kg.apc.cmd.UniversalRunner.clinit(UniversalRunner.java:55)

 at
 kg.apc.cmd.UniversalRunner.buildUpdatedClassPath(UniversalRunner.java:109)
 at kg.apc.cmd.UniversalRunner.clinit(UniversalRunner.java:55)



 On Tue, Sep 3, 2013 at 2:50 AM, Roman Chyla roman.ch...@gmail.com wrote:

  Hi Dmitry,
 
  If it is something you want to pass with every request (which is my use
  case), you can pass it as additional solr params, eg.
 
  python solrjmeter
 
 
 --additionalSolrParams=fq=other_field:bar+facet=true+facet.field=facet_field_name
  
 
  the string should be url encoded.
 
  If it is something that changes with every request, you should modify the
  jmeter test. If you open/load it with jmeter GUI, in the HTTP request
  processor you can define other additional fields to pass with the
 request.
  These values can come from the CSV file, you'll see an example how to use
  that when you open the test difinition file.
 
  Cheers,
 
roman
 
 
 
 
  On Mon, Sep 2, 2013 at 3:12 PM, Dmitry Kan solrexp...@gmail.com wrote:
 
   Hi Erick,
  
   Agree, this is perfectly fine to mix them in solr. But my question is
  about
   solrjmeter input query format. Just couldn't find a suitable example on
  the
   solrjmeter's github.
  
   Dmitry
  
  
  
   On Mon, Sep 2, 2013 at 5:40 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
filter and facet queries can be freely intermixed, it's not a
 problem.
What problem are you seeing when you try this?
   
Best,
Erick
   
   
On Mon, Sep 2, 2013 at 7:46 AM, Dmitry Kan solrexp...@gmail.com
  wrote:
   
 Hi Roman,

 What's the format for running the facet+filter queries?

 Would something like this work:

 field:foo  =50  fq=other_field:bar facet=true
facet.field=facet_field_name


 Thanks,
 Dmitry



 On Fri, Aug 23, 2013 at 2:34 PM, Dmitry Kan solrexp...@gmail.com
wrote:

  Hi Roman,
 
  With adminPath=/admin or adminPath=/admin/cores, no.
   Interestingly
  enough, though, I can access
  http://localhost:8983/solr/statements/admin/system
 
  But I can access http://localhost:8983/solr/admin/cores, only
 when
with
  adminPath=/admin/cores (which suggests that this is the right
  value
to
 be
  used for cores), and not with adminPath=/admin.
 
  Bottom line, these core configuration is not self-evident.
 
  Dmitry
 
 
 
 
  On Fri, Aug 23, 2013 at 4:18 AM, Roman Chyla 
  roman.ch...@gmail.com
 wrote:
 
  Hi Dmitry,
  So it seems solrjmeter should not assume the adminPath - and
  perhaps
 needs
  to be passed as an argument. When you set the adminPath, are you
   able
to
  access localhost:8983/solr/statements/admin/cores ?
 
  roman
 
 
  On Wed, Aug 21, 2013 at 7:36 AM, Dmitry Kan 
 solrexp...@gmail.com
  
 wrote:
 
   Hi Roman,
  
   I have noticed a difference with different solr.xml config
   contents.
 It
  is
   probably legit, but thought to let you know (tests run on
 fresh
  checkout as
   of today).
  
   As mentioned before, I have two cores configured in solr.xml.
 If
   the
  file
   is:
  
   [code]
   solr persistent=false
  
 !--
 adminPath: RequestHandler path to manage cores.
   If 'null' (or absent), cores will not be manageable via
   request
  handler
 --
 cores adminPath=/admin/cores host=${host:}
   hostPort=${jetty.port:8983}
 hostContext=${hostContext:solr}
   core name=metadata instanceDir=metadata /
   core name=statements instanceDir=statements /
 /cores
   /solr
   [/code

Re: Dynamic Query Analyzer

2013-09-03 Thread Roman Chyla
You don't need to index fields several times, you can index is just into
one field, and use the different query analyzers just to build the query.
We're doing this for authors, for example - if query language says
=author:einstein, the query parser knows this field should be analyzed
differently (that is the part of your application logic, of your query
language semantics - so it can vary).

The parser will change the 'author' to 'nosynonym_author', this means
'nosynonym_author' analyzer to be used for analysis phase, and after the
query has been prepared, we 'simply' change the query field from
'nosynonym_author' into 'author'. Seems complex, but it is actually easy.
But it depends on what a query parser you can/want to use. I use this:
https://issues.apache.org/jira/browse/LUCENE-5014

roman




On Tue, Sep 3, 2013 at 11:41 AM, Daniel Rosher rosh...@gmail.com wrote:

 Hi,

 We have a need to specify a different query analyzer depending on input
 parameters dynamically.

 We need this so that we can use different stopword lists at query time.

 Would any one know how I might be able to achieve this in solr?

 I'm aware of the solution to specify different field types, each with a
 different query analyzer, but I'd like not to have to index the field
 multiple times.

 Many thanks
 Dab



Web App Engineer at Harvard-Smithsonian Astrophysical Observatory, full time, indefinite contract

2013-10-07 Thread Roman Chyla
Dear all,

We are looking for a new member to join our team. This position requires
solid knowledge of Python, plus experience with web development, HTML5,
XSLT, JSON, CSS3, relational databases and NoSQL but search (and SOLR) is
the central point of everything we do here. So, if you love SOLR/Lucene as
we do, then I'm sure there will be plenty of opportunities for search
related development for you too.


About the project:

http://labs.adsabs.harvard.edu/adsabs/

The ADS is the central discovery engine for astronomical information, used
nearly every day by nearly every astronomer. Conceived 20 years ago, moving
into its third decade, the ADS continues to serve the research community
worldwide.

The ADS is currently developing the next-generation web-based platform
supporting current and future services. The project is committed to
developing and using open-source software. The main components of the
system architecture are: Apache SOLR/Lucene (search), CERN Invenio and
MongoDB (storage), Python+Flask+Bootstrap (frontend).

We are looking for a highly-motivated full-stack developer interested in
joining a dynamic team of talented individuals architecting and
implementing the new platform. Your primary responsibility is the design,
development, and support of the ADS front-end applications (including the
new search interface) as well as the implementation of the user database,
login system and personalization features.

For more information, please see the full posting online at:
http://www.cfa.harvard.edu/hr/postings/13-32.html


Thank you,

  Roman

--

Dr. Roman Chyla
ADS, Harvard-Smithsonian Center for Astrophysics
roman.ch...@gmail.com


Re: Solr's Filtering approaches

2013-10-12 Thread Roman Chyla
David,
We have a similar query in astrophysics, an user can select an area of the
skymany stars out there

I am long overdue in creating a Jira issue, but here you have another
efficient mechanism for searching large number of ids

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

Roman
On 12 Oct 2013 01:57, David Philip davidphilipshe...@gmail.com wrote:

 Groups are pharmaceutical research expts.. User is presented with graph
 view, he can select some region and all the groups in that region gets
 included..user can modify the groups also here.. so we didn't maintain
 group information in same solr index but we have externalized.
 I looked at post filter article. So my understanding is that, I simply have
 to extended as you did and should include implementaton for
 isAllowed(acls[doc], groups) .This will filter the documents in the
 collector and finally this collector will be returned. am I right?

   @Override
   public void collect(int doc) throws IOException {
 if (isAllowed(acls[doc], user, groups)) super.collect(doc);
   }


 Erick, I am interested to know whether I can extend any class that can
 return me only the bitset of the documents that match the search query. I
 can then do bitset1.andbitset2OfGroups - finally, collect only those
 documents to return to user. How do I try this approach? Any pointers for
 bit set?

 Thanks - David




 On Thu, Oct 10, 2013 at 5:25 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Well, my first question is why 50K groups is necessary, and
  whether you can simplify that. How a user can manually
  choose from among that many groups is interesting. But
  assuming they're all necessary, I can think of two things.
 
  If the user can only select ranges, just put in filter queries
  using ranges. Or possibly both ranges and individual entries,
  as fq=group:[1A TO 1A] OR group:(2B 45C 98Z) etc.
  You need to be a little careful how you put index these so
  range queries work properly, in the above you'd miss
  2A because it's sorting lexicographically, you'd need to
  store in some form that sorts like 001A 01A
  and so on. You wouldn't need to show that form to the
  user, just form your fq's in the app to work with
  that form.
 
  If that won't work (you wouldn't want this to get huge), think
  about a post filter that would only operate on documents that
  had made it through the select, although how to convey which
  groups the user selected to the post filter is an open
  question.
 
  Best,
  Erick
 
  On Wed, Oct 9, 2013 at 12:23 PM, David Philip
  davidphilipshe...@gmail.com wrote:
   Hi All,
  
   I have an issue in handling filters for one of our requirements and
   liked to get suggestion  for the best approaches.
  
  
   *Use Case:*
  
   1.  We have List of groups and the number of groups can increase upto
 1
   million. Currently we have almost 90 thousand groups in the solr search
   system.
  
   2.  Just before the user hits a search, He has options to select the
 no.
  of
groups he want to retrieve. [the distinct list of these group Names
 for
   display are retrieved from other solr index that has more information
  about
   groups]
  
   *3.User Operation:** *
   Say if user selected group 1A  - group 1A.  and searches for
  key:cancer.
  
  
   The current approach I was thinking is : get search results and filter
   query by groupids' list selected by user. But my concern is When these
   groups list is increasing to 50k unique Ids, This can cause lot of
 delay
   in getting search results. So wanted to know whether there are
 different
filtering ways that I can try for?
  
   I was thinking of one more approach as suggested by my colleague to do
 -
intersection.  -
   Get the groupIds' selected by user.
   Get the list of groupId's from search results,
   Perform intersection of both and then get the entire result set of only
   those groupid that intersected. Is this better way? Can I use any cache
   technique in this case?
  
  
   - David.
 



Re: Complex Queries in solr

2013-10-20 Thread Roman Chyla
i just tested it whether our 'beautifu' parser supports it, and funnily
enough, it does :-)
https://github.com/romanchyla/montysolr/commit/f88577345c6d3a2dbefc0161f6bb07a549bc6b15

but i've (kinda) given up hope that people need powerful query parsers in
the lucene world, the LUCENE-5014 is there sitting without attention for
eons ... so your best bet is probably to use LucidWorks parser (but i don't
know if it supports proximity searches combined with wildcards and
booleans) or you can create a custom query plugin that builds the following
query

spanNear([spanOr([SpanMultiTermQueryWrapper(all:consult*),
SpanMultiTermQueryWrapper(all:advis*)]), spanOr([all:fee,
all:retainer, all:salary, all:bonus])], 4, true)

you can draw inspiration from here:
https://github.com/romanchyla/montysolr/blob/master/contrib/antlrqueryparser/src/java/org/apache/lucene/queryparser/flexible/aqp/builders/AqpNearQueryNodeBuilder.java
https://github.com/romanchyla/montysolr/blob/master/contrib/antlrqueryparser/src/java/org/apache/lucene/queryparser/flexible/aqp/builders/SpanConverter.java


but i think you are aware that such a query is NOT going to be very
efficient. especially when proximity=40 ;-)

hth

roman



On Fri, Oct 18, 2013 at 3:28 AM, sayeed abdulsayeed...@gmail.com wrote:

 Hi,
 Is it possible to search complex queries like
 (consult* or advis*) NEAR(40) (fee or retainer or salary or bonus)
 in solr




 -
 Sayeed
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Complex-Queries-in-solr-tp4096288.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Compound words

2013-10-28 Thread Roman Chyla
Hi Parvesh,
I think you should check the following jira
https://issues.apache.org/jira/browse/SOLR-5379. You will find there links
to other possible solutions/problems:-)
Roman
On 28 Oct 2013 09:06, Erick Erickson erickerick...@gmail.com wrote:

 Consider setting expand=true at index time. That
 puts all the tokens in your index, and then you
 may not need to have any synonym
 processing at query time since all the variants will
 already be in the index.

 As it is, you've replaced the words in the original with
 synonyms, essentially collapsed them down to a single
 word and then you have to do something at query time
 to get matches. If all the variants are in the index, you
 shouldn't have to. That's what I meant by raw.

 Best,
 Erick


 On Mon, Oct 28, 2013 at 8:02 AM, Parvesh Garg parv...@zettata.com wrote:

  Hi Erick,
 
  Thanks for the suggestion. Like I said, I'm an infant.
 
  We tried synonyms both ways. sea biscuit = seabiscuit and seabiscuit =
  sea biscuit and didn't understand exactly how it worked. But I just
 checked
  the analysis tool, and it seems to work perfectly fine at index time.
 Now,
  I can happily discard my own filter and 4 days of work. I'm happy I got
 to
  know a few ways on how/when not to write a solr filter :)
 
  I tried the string sea biscuit sea bird with expand=false and the
 tokens
  i got were seabiscuit sea bird at 1,2 and 3 positions respectively. But
 at
  query time, when I enter the same term sea biscuit sea bird, using
  edismax and qf, pf2, and pf3, the parsedQuery looks like this:
 
  +((text:sea) (text:biscuit) (text:sea) (text:bird)) ((text:\biscuit
 sea\)
  (text:\sea bird\)) ((text:\seabiscuit sea\) (text:\biscuit sea
  bird\))
 
  What I wanted instead was this
 
  +((text:seabiscuit) (text:sea) (text:bird)) ((text:\seabiscuit sea\)
  (text:\sea bird\)) (text:\seabiscuit sea bird\)
 
  Looks like there isn't any other way than to pre-process query myself and
  create the compound word. What do you mean by just query the raw
 string?
  Am I still missing something?
 
  Parvesh Garg
  http://www.zettata.com
  (This time I did remove my phone number :) )
 
  On Mon, Oct 28, 2013 at 4:14 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Why did you reject using synonyms? You can have multi-word
   synonyms just fine at index time, and at query time, since the
   multiple words are already substituted in the index you don't
   need to do the same substitution, just query the raw strings.
  
   I freely acknowledge you may have very good reasons for doing
   this yourself, I'm just making sure you know what's already
   there.
  
   See:
  
  
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
  
   Look particularly at the explanations for sea biscuit in that
 section.
  
   Best,
   Erick
  
  
  
   On Mon, Oct 28, 2013 at 3:47 AM, Parvesh Garg parv...@zettata.com
  wrote:
  
One more thing, Is there a way to remove my accidentally sent phone
   number
in the signature from the previous mail? aarrrggghhh
   
  
 



Re: Recherche avec et sans espaces

2013-11-04 Thread Roman Chyla
Hi Antoine,
I'll permit myself to respond in English, cause my written French is
slower;-)
Your problem is a well known amongst Sold users, the query parser splits
tokens by empty space, so the analyser never sees input 'la redoutte' but
it receives 'la' 'reroute'. You can of course enclose your search in quotes
like ”la redoutte but it is hard to force your users to do the sameI
have solved this and related problems for our astrophysics system by
writing a better query parser that does search both for individual tokens
and for phrases, so essentially the parser decides when to join tokens
together - and this takes care also of multi-token synonyms, because
synonym recognition is related issue, it happens in the analysis phase, and
that one comes after parsing. The code is there in lucene-5014 and I'll
perhaps make it available as a simple jar that you can drop inside solr,
but impossible to do sion, it is too busy But I hope the explanation
will help you to search for a solution, you need to make sure that your
analysis chain sees 'la redoutte' and then, because you are using
whitespace tokenizer, you need to define the synonyms laredoutte,la\
redoutte

Hth

Roman
On 4 Nov 2013 11:48, Antoine REBOUL antoine.reb...@gmail.com wrote:

 Bonjour,

 je souhaite faire en sorte que les recherches dans un champs de type texte
 renvoient des résultats même si les espaces sont mal saisies
 (par exemple : la redoute=laredoute).

 Aujourd'hui mon champ texte est défini de la façon suivante :


 fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.StopFilterFactory
  ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
  /
 filter class=solr.ElisionFilterFactory articles=elisions.txt/
  filter class=solr.SynonymFilterFactory synonyms=synonyms2.txt
 ignoreCase=true expand=false/
 filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
  catenateWords=1
 catenateNumbers=1
 catenateAll=1
  splitOnCaseChange=1
 splitOnNumerics=1
 preserveOriginal=1
  /
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
  analyzer type=query
 filter class=solr.ISOLatin1AccentFilterFactory/
  tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1
 generateNumberParts=1
 catenateWords=1
  catenateNumbers=0
 catenateAll=1
 splitOnCaseChange=1
  preserveOriginal=1
 /
 filter class=solr.StopFilterFactory
  ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
  /
 filter class=solr.ElisionFilterFactory articles=elisions.txt/
  filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType






 Merci d'avance pour vos éventuelles réponses.
 Cordialement.

 Antoine Reboul
 *



Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-06 Thread Roman Chyla
Hello,

We have two solr searchers/instances (read-only). They read the same index,
but they did not return the same #hits for a particular query

Log is below, but to summarize: first server always returns 576 hits, the
second server returns: 440, 440, 576, 576...

These are just few seconds apart. Load balancer directed requests to both
servers. Both servers report the same numHits for other queries. I checked
that nothing re-opened index, there was no errorthis is SOLR 4.0 (we
should update, I know), running CentOS, the index lives on a RAID5 mounted
volume, both instances just read it (the index wasn't updated while these
searches happened).

Anybody has a pointer, I can't really understand it. Can it be a bug?

Thanks,

  roman



If you look at the log below, you will see 9002 instance always returns 576
hits, but 9003 instance is returning 440, 440, 65, 576


-bash-4.1$ grep -a -e 'Jones,+Christine+year:1990-2100'
./perpetuum/live-9002/solr-logging-0.log | grep -m 5 '2013-11-06 13:'
2013-11-06 13:10:47 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=400q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=155
2013-11-06 13:17:05 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=trueq=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=89
2013-11-06 13:17:06 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=87
2013-11-06 13:21:50 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=86
2013-11-06 13:21:51 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=400q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=87


-bash-4.1$ grep -a -e 'Jones,+Christine+year:1990-2100'
./perpetuum/live-9003/solr-logging-0.log | grep -m 5 '2013-11-06 13:'
2013-11-06 13:10:46 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=trueq=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=440 status=0 QTime=144
2013-11-06 13:10:46 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=440 status=0 QTime=78
2013-11-06 13:10:48 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select

Re: Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-06 Thread Roman Chyla
No, and I should add that this query was not against shards, just a one
single index (and we dont use timeouts).

--roman


On Wed, Nov 6, 2013 at 5:28 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Does the header in the response indicate you're getting partialResults?

 http://help.websolr.com/kb/common-problems/why-am-i-getting-partial-results

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
 w: appinions.com http://www.appinions.com/


 On Wed, Nov 6, 2013 at 4:23 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Hello,
 
  We have two solr searchers/instances (read-only). They read the same
 index,
  but they did not return the same #hits for a particular query
 
  Log is below, but to summarize: first server always returns 576 hits, the
  second server returns: 440, 440, 576, 576...
 
  These are just few seconds apart. Load balancer directed requests to both
  servers. Both servers report the same numHits for other queries. I
 checked
  that nothing re-opened index, there was no errorthis is SOLR 4.0 (we
  should update, I know), running CentOS, the index lives on a RAID5
 mounted
  volume, both instances just read it (the index wasn't updated while these
  searches happened).
 
  Anybody has a pointer, I can't really understand it. Can it be a bug?
 
  Thanks,
 
roman
 
 
 
  If you look at the log below, you will see 9002 instance always returns
 576
  hits, but 9003 instance is returning 440, 440, 65, 576
 
 
  -bash-4.1$ grep -a -e 'Jones,+Christine+year:1990-2100'
  ./perpetuum/live-9002/solr-logging-0.log | grep -m 5 '2013-11-06 13:'
  2013-11-06 13:10:47 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=400q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
  hits=576 status=0 QTime=155
  2013-11-06 13:17:05 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=trueq=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
  hits=576 status=0 QTime=89
  2013-11-06 13:17:06 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
  hits=576 status=0 QTime=87
  2013-11-06 13:21:50 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
  hits=576 status=0 QTime=86
  2013-11-06 13:21:51 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=400q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
  hits=576 status=0 QTime=87
 
 
  -bash-4.1$ grep -a -e 'Jones,+Christine+year:1990-2100'
  ./perpetuum/live-9003/solr-logging-0.log | grep -m 5 '2013-11-06 13:'
  2013-11-06 13:10:46 INFO org.apache.solr.core.SolrCore execute
  [collection1] webapp=/solr path=/select
 
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=trueq=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt

Re: Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-07 Thread Roman Chyla
Thanks Michael, haven't tried that yet.

Anybody has suggestions on what might be the problem there? SOLR cache?
DiskI/O? Something else..?

--roman


On Wed, Nov 6, 2013 at 9:41 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Wow, that's pretty weird. Have you tried turning logging down to debug and
 seeing if anything interesting shakes out?

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
 w: appinions.com http://www.appinions.com/


 On Wed, Nov 6, 2013 at 6:40 PM, Roman Chyla roman.ch...@gmail.com wrote:

  No, and I should add that this query was not against shards, just a one
  single index (and we dont use timeouts).
 
  --roman
 
 
  On Wed, Nov 6, 2013 at 5:28 PM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Does the header in the response indicate you're getting partialResults?
  
  
 
 http://help.websolr.com/kb/common-problems/why-am-i-getting-partial-results
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
  
 
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
   
   w: appinions.com http://www.appinions.com/
  
  
   On Wed, Nov 6, 2013 at 4:23 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
Hello,
   
We have two solr searchers/instances (read-only). They read the same
   index,
but they did not return the same #hits for a particular query
   
Log is below, but to summarize: first server always returns 576 hits,
  the
second server returns: 440, 440, 576, 576...
   
These are just few seconds apart. Load balancer directed requests to
  both
servers. Both servers report the same numHits for other queries. I
   checked
that nothing re-opened index, there was no errorthis is SOLR 4.0
  (we
should update, I know), running CentOS, the index lives on a RAID5
   mounted
volume, both instances just read it (the index wasn't updated while
  these
searches happened).
   
Anybody has a pointer, I can't really understand it. Can it be a bug?
   
Thanks,
   
  roman
   
   
   
If you look at the log below, you will see 9002 instance always
 returns
   576
hits, but 9003 instance is returning 440, 440, 65, 576
   
   
-bash-4.1$ grep -a -e 'Jones,+Christine+year:1990-2100'
./perpetuum/live-9002/solr-logging-0.log | grep -m 5 '2013-11-06 13:'
2013-11-06 13:10:47 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
   
   
  
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=400q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=155
2013-11-06 13:17:05 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
   
   
  
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=trueq=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=89
2013-11-06 13:17:06 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
   
   
  
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=87
2013-11-06 13:21:50 INFO org.apache.solr.core.SolrCore execute
[collection1] webapp=/solr path=/select
   
   
  
 
 params={hl.requireFieldMatch=truefl=bibcode,pubdate,keyword,author,property,abstract,bibstem,citation_count,pub,[citations],volume,database,aff,grants,year,id,title,identifier,issue,page,doisort=citation_count+descindent=truestart=200q=author:Jones,+Christine+year:1990-2100hl.usePhraseHighlighter=truehl.maxAnalyzedChars=15wt=jsonfq=database:astronomyrows=200}
hits=576 status=0 QTime=86
2013-11-06 13:21:51

building custom cache - using lucene docids

2013-11-22 Thread Roman Chyla
Hi,
docids are 'ephemeral', but i'd still like to build a search cache with
them (they allow for the fastest joins).

i'm seeing docids keep changing with updates (especially, in the last index
segment) - as per
https://issues.apache.org/jira/browse/LUCENE-2897

That would be fine, because i could build the cache from diff (of index
state) + reading the latest index segment in its entirety. But can I assume
that docids in other segments (other than the last one) will be relatively
stable? (ie. when an old doc is deleted, the docid is marked as removed;
update doc = delete old  create a new docid)?

thanks

roman


Re: building custom cache - using lucene docids

2013-11-23 Thread Roman Chyla
Hi Erick,
Many thanks for the info. An additional question:

Do i understand you correctly that when two segmets get merged, the docids
(of the original segments) remain the same?

(unless, perhaps in situation, they were merged using the last index
segment which was opened for writing and where the docids could have
suddenly changed in a commit just before the merge)

Yes, you guessed right that I am putting my code into the custom cache - so
it gets notified on index changes. I don't know yet how, but I think I can
find the way to the current active, opened (last) index segment. Which is
actively updated (as opposed to just being merged) -- so my definition of
'not last ones' is: where docids don't change. I'd be grateful if someone
could spot any problem with such assumption.

roman




On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.comwrote:

 bq: But can I assume
 that docids in other segments (other than the last one) will be relatively
 stable?

 Kinda. Maybe. Maybe not. It depends on how you define other than the
 last one.

 The key is that the internal doc IDs may change when segments are
 merged. And old segments get merged. Doc IDs will _never_ change
 in a segment once it's closed (although as you note they may be
 marked as deleted). But that segment may be written to a new segment
 when merging and the internal ID for a given document in the new
 segment bears no relationship to internal ID in the old segment.

 BTW, I think you only really care when opening a new searchers. There is
 a UserCache (see solrconfig.xml) that gets notified when a new searcher
 is being opened to give it an opportunity to refresh itself, is that
 useful?

 As long as a searcher is open, it's guaranteed that nothing is changing.
 Hard commits with openSearcher=false don't open new searchers, which
 is why changes aren't visible until a softCommit or a hard commit with
 openSearcher=true despite the fact that the segments are closed.

 FWIW,
 Erick

 Best
 Erick



 On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman
 



Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Sun, Nov 24, 2013 at 8:31 AM, Erick Erickson erickerick...@gmail.comwrote:

 bq: Do i understand you correctly that when two segmets get merged, the
 docids
 (of the original segments) remain the same?

 The original segments are unchanged, segments are _never_ changed after
 they're closed. But they'll be thrown away. Say you have segment1 and
 segment2 that get merged into segment3. As soon as the last searcher
 that is looking at segment1 and segment2 is closed, those two segments
 will be deleted from your disk.

 But for any given doc, the docid in segment3 will very likely be different
 than it was in segment1 or 2.


i'm trying to figure this out - i'll have to dig, i suppose. for example,
if the docbase (the docid offset per searcher) was stored together with the
index segment, that would be an indication of 'relative stability of docids'



 I think you're reading too much into LUCENE-2897. I'm pretty sure the
 segment in question is not available to you anyway before this rewrite is
 done,
 but freely admit I don't know much about it.


i've done tests, committing and overwriting a document and saw (SOLR4.0)
that docids are being recycled. I deleted 2 docs, then added a new document
and guess what: the new document had the docid of the previously deleted
document (but different fields).

That was new to me, so I searched and found the LUCENE-2897 which seemed to
explain that behaviour.



 You're probably going to get into the whole PerSegment family of
 operations,
 which is something I'm not all that familiar with so I'll leave
 explanations
 to others.


Thank you, it is useful to get insights from various sides,

  roman



 On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Erick,
  Many thanks for the info. An additional question:
 
  Do i understand you correctly that when two segmets get merged, the
 docids
  (of the original segments) remain the same?
 
  (unless, perhaps in situation, they were merged using the last index
  segment which was opened for writing and where the docids could have
  suddenly changed in a commit just before the merge)
 
  Yes, you guessed right that I am putting my code into the custom cache -
 so
  it gets notified on index changes. I don't know yet how, but I think I
 can
  find the way to the current active, opened (last) index segment. Which is
  actively updated (as opposed to just being merged) -- so my definition of
  'not last ones' is: where docids don't change. I'd be grateful if someone
  could spot any problem with such assumption.
 
  roman
 
 
 
 
  On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   bq: But can I assume
   that docids in other segments (other than the last one) will be
  relatively
   stable?
  
   Kinda. Maybe. Maybe not. It depends on how you define other than the
   last one.
  
   The key is that the internal doc IDs may change when segments are
   merged. And old segments get merged. Doc IDs will _never_ change
   in a segment once it's closed (although as you note they may be
   marked as deleted). But that segment may be written to a new segment
   when merging and the internal ID for a given document in the new
   segment bears no relationship to internal ID in the old segment.
  
   BTW, I think you only really care when opening a new searchers. There
 is
   a UserCache (see solrconfig.xml) that gets notified when a new searcher
   is being opened to give it an opportunity to refresh itself, is that
   useful?
  
   As long as a searcher is open, it's guaranteed that nothing is
 changing.
   Hard commits with openSearcher=false don't open new searchers, which
   is why changes aren't visible until a softCommit or a hard commit with
   openSearcher=true despite the fact that the segments are closed.
  
   FWIW,
   Erick
  
   Best
   Erick
  
  
  
   On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi,
docids are 'ephemeral', but i'd still like to build a search cache
 with
them (they allow for the fastest joins).
   
i'm seeing docids keep changing with updates (especially, in the last
   index
segment) - as per
https://issues.apache.org/jira/browse/LUCENE-2897
   
That would be fine, because i could build the cache from diff (of
 index
state) + reading the latest index segment in its entirety. But can I
   assume
that docids in other segments (other than the last one) will be
   relatively
stable? (ie. when an old doc is deleted, the docid is marked as
  removed;
update doc = delete old  create a new docid)?
   
thanks
   
roman
   
  
 



Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky j...@basetechnology.comwrote:

 We should probably talk about internal Lucene document IDs and
 external or rebased Lucene document IDs. The internal document IDs are
 always per-segment and never, ever change for that closed segment. But...
 the application would not normally see these IDs. Usually the externally
 visible Lucene document IDs have been rebased to add the sum total count
 of documents (both existing and deleted) of all preceding segments to the
 document IDs of a given segment, producing a global (across the full
 index of all segments) Lucene document ID.

 So, if you have those three segments, with deleted documents in the first
 two segments, and then merge those first two segments, the
 externally-visible Lucene document IDs for the third segment will suddenly
 all be different, shifted lower by the number of deleted documents that
 were just merged away, even though nothing changed in the third segment
 itself.


That's right, and I'm starting to think that if i keep the segment id and
the original offset, i don't need to rebuild that part of the cache,
because it has not been rebased (but I can always update the deleted docs).
It seems simple so I'm suspecting to find a catch somewhere. but if it
works, that could potentially speed up any cache building

Do you have information where the docbase of the segment are stored? Or
which java class I should start my exploration from? [it is somewhat
sprawling complex, so I'm bit lost :)]



 Maybe these should be called local (to the segment) Lucene document IDs
 and global (across all segment) Lucene document IDs. Or, maybe internal
 vs. external is good enough.

 In short, it is completely safe to use and save Lucene document IDs, but
 only as long as no merging of segments is performed. Even one tiny merge
 and all subsequent saved document IDs are invalidated. Be careful with your
 merge policy - normally merges are happening in the background,
 automatically.


my tests, as per previous email, showed that the last segment docid's are
not that stable. I don't know if it matters that I used the RAMDirectory
for the test, but the docids were being 'recycled' -  the deleted docs were
in the previous segment, then suddently their docids were inside newly
added documents (so maybe solr/lucene is not counting deleted docs, if they
are at the end of a segment...?) i don't know. i'll need to explore the
index segments to understand what was going on there, thanks for any
possible pointers


  roman





 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Sunday, November 24, 2013 8:31 AM
 To: solr-user@lucene.apache.org
 Subject: Re: building custom cache - using lucene docids


 bq: Do i understand you correctly that when two segmets get merged, the
 docids
 (of the original segments) remain the same?

 The original segments are unchanged, segments are _never_ changed after
 they're closed. But they'll be thrown away. Say you have segment1 and
 segment2 that get merged into segment3. As soon as the last searcher
 that is looking at segment1 and segment2 is closed, those two segments
 will be deleted from your disk.

 But for any given doc, the docid in segment3 will very likely be different
 than it was in segment1 or 2.

 I think you're reading too much into LUCENE-2897. I'm pretty sure the
 segment in question is not available to you anyway before this rewrite is
 done,
 but freely admit I don't know much about it.

 You're probably going to get into the whole PerSegment family of
 operations,
 which is something I'm not all that familiar with so I'll leave
 explanations
 to others.


 On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Erick,
 Many thanks for the info. An additional question:

 Do i understand you correctly that when two segmets get merged, the docids
 (of the original segments) remain the same?

 (unless, perhaps in situation, they were merged using the last index
 segment which was opened for writing and where the docids could have
 suddenly changed in a commit just before the merge)

 Yes, you guessed right that I am putting my code into the custom cache -
 so
 it gets notified on index changes. I don't know yet how, but I think I can
 find the way to the current active, opened (last) index segment. Which is
 actively updated (as opposed to just being merged) -- so my definition of
 'not last ones' is: where docids don't change. I'd be grateful if someone
 could spot any problem with such assumption.

 roman




 On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  bq: But can I assume
  that docids in other segments (other than the last one) will be
 relatively
  stable?
 
  Kinda. Maybe. Maybe not. It depends on how you define other than the
  last one.
 
  The key is that the internal doc IDs may change when segments are
  merged. And old segments get merged. Doc IDs will _never_ change

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Roman,

 I don't fully understand your question. After segment is flushed it's never
 changed, hence segment-local docids are always the same. Due to merge
 segment can gone, its' docs become new ones in another segment.  This is
 true for 'global' (Solr-style) docnums, which can flip after merge is
 happened in the middle of the segments' chain.
 As well you are saying about segmented cache I can propose you to look at
 CachingWrapperFilter and NoOpRegenerator as a pattern for such data
 structures.


Thanks Mikhail, the CWF confirms that the idea of regenerating just part of
the cache is doable. The CacheRegenerators, on the other hand, make no
sense to me - and they are not given any 'signals', so they don't know if
they are in the middle of some regeneration or not, and they should not
keep a state (of previous index) - as they can be shared by threads that
build the cache

Best,

  roman




 On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
OK, I've spent some time reading the solr/lucene4x classes, and this is
myunderstanding (feel free to correct me ;-))

DirectoryReader holds the opened segments -- each segment has its own
reader, the BaseCompositeReader (or extended classes thereof) store the
offsets per each segment; eg. [0, 5, 22] - meaning, there are 2 segments,
with 5, and 17 docs respectively

The segments are listed in the segments_N file,
http://lucene.apache.org/core/3_0_3/fileformats.html#Segments
File

So theoretically, order of segments could change when merge happens - yet,
every SegmentReader is identified by unique name and this name doesn't
change unless the segment itself changed (ie. docs were deleted; or got
more docs) - so it is possible to rely on this name to know what has not
changed

the name is coming from SegmentInfo (check its toString method) -- the
SegmentInfo has a method equals() that will consider as equal the readers
with the same name and the same dir (which is useful to know - two readers,
one with deletes, one without, are equal)

Lucene's FieldCache itself is rather complex, but it shows there is a very
clever mechanism (a few actually!) -- a class can register a listener that
will be called whenever an index segments is being closed (this could be
used to invalidate portions of a cache), the relevant classes are:
SegmentReader.CoreClosedListener, IndexReader.ReaderClosedListener

But Lucene is using this mechanism only to purge the cache - so
effectively, every commits triggers cache rebuild. This is the interesting
bit: lots of work could be spared if segments data were reused  (but
admittedly, only sometimes - for data that was fully read into memory, for
anything else, such as terms, the cache reads only some values and is
fetching the rest from the index - so Lucene must close the reader and
rebuild the cache on every commit; but that is not my case, as I am to copy
values from an index, and store them in memory...)

the weird 'recyclation' of docids I've observed can probably be explained
by the fact that the index reader contains segments and near realtime
readers (but I'm not sure about this)

To conclude: it is possible to build a cache that updates itself (with only
changes committed since the last build) - this will have impact on how fast
new searcher is ready to serve requests

HTH somebody else too :)

  roman



On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla roman.ch...@gmail.com wrote:




 On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Roman,

 I don't fully understand your question. After segment is flushed it's
 never
 changed, hence segment-local docids are always the same. Due to merge
 segment can gone, its' docs become new ones in another segment.  This is
 true for 'global' (Solr-style) docnums, which can flip after merge is
 happened in the middle of the segments' chain.
 As well you are saying about segmented cache I can propose you to look at
 CachingWrapperFilter and NoOpRegenerator as a pattern for such data
 structures.


 Thanks Mikhail, the CWF confirms that the idea of regenerating just part
 of the cache is doable. The CacheRegenerators, on the other hand, make no
 sense to me - and they are not given any 'signals', so they don't know if
 they are in the middle of some regeneration or not, and they should not
 keep a state (of previous index) - as they can be shared by threads that
 build the cache

 Best,

   roman




 On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com





Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
Hi,
I'd like to check - there is something I don't understand about cache - and
I don't know if it is a bug, or feature

the following calls return a cache

FieldCache.DEFAULT.getTerms(reader, idField);
FieldCache.DEFAULT.getInts(reader, idField, false);


the resulting arrays *will* contain entries for deleted docs, so to filter
them out, one has to manually check livedocs. Is this the expected
behaviour? I don't understand why the cache would be bothering to load data
for deleted docs. This is on SOLR4.0

Thanks!

  roman


Re: Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
I understand that changes would be expensive, but shouldn't the cache
simply skip the deleted docs? In the same way as the cache for multivalued
fields (that accepts livedocs bits).
Thanks,

  roman


On Wed, Nov 27, 2013 at 6:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 Yep, it's expected. Segments are write-once. It's been
 a long standing design that deleted data will be
 reclaimed on segment merge, but not before. It's
 pretty expensive to change the terms loaded on the
 fly to respect deleted document's removed data.

 Best,
 Erick


 On Wed, Nov 27, 2013 at 4:07 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  I'd like to check - there is something I don't understand about cache -
 and
  I don't know if it is a bug, or feature
 
  the following calls return a cache
 
  FieldCache.DEFAULT.getTerms(reader, idField);
  FieldCache.DEFAULT.getInts(reader, idField, false);
 
 
  the resulting arrays *will* contain entries for deleted docs, so to
 filter
  them out, one has to manually check livedocs. Is this the expected
  behaviour? I don't understand why the cache would be bothering to load
 data
  for deleted docs. This is on SOLR4.0
 
  Thanks!
 
roman
 



Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Roman Chyla
Isaac, is there an easy way to recognize this problem? We also index
synonym tokens in the same position (like you do, and I'm sure that our
positions are set correctly). I could test whether the default similarity
factory in solrconfig.xml had any effect (before/after reindexing).

--roman


On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi Robert and Manuel.

 The DefaultSimilarity indeed sets discountOverlap to true by default.
 BUT, the *factory*, aka DefaultSimilarityFactory, when called by
 IndexSchema (the getSimilarity method), explicitly sets this value to the
 value of its corresponding class member.
 This class member is initialized to be FALSE  when the instance is created
 (like every boolean variable in the world). It should be set when init
 method is called. If the parameter is not set in schema.xml, the default is
 true.

 Everything seems to be alright, but the issue is that init method is NOT
 called, if the similarity is not *explicitly* declared in schema.xml. In
 that case, init method is not called, the discountOverlaps member (of the
 factory class) remains FALSE, and getSimilarity explicitly calls
 setDiscountOverlaps with value of FALSE.

 This is very easy to reproduce and debug.


 On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:

  no, its turned on by default in the default similarity.
 
  as i said, all that is necessary is to fix your analyzer to emit the
  proper position increments.
 
  On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
  manuel.lenorm...@gmail.com wrote:
   In order to set discountOverlaps to true you must have added the
   similarity class=solr.DefaultSimilarityFactory to the schema.xml,
  which
   is commented out by default!
  
   As by default this param is false, the above situation is expected with
   correct positioning, as said.
  
   In order to fix the field norms you'd have to reindex with the
 similarity
   class which initializes the param to true.
  
   Cheers,
   Manu
 



Re: Commit Issue in Solr 3.4

2014-02-08 Thread Roman Chyla
I would be curious what the cause is. Samarth says that it worked for over
a year /and supposedly docs were being added all the time/. Did the index
grew considerably in the last period? Perhaps he could attach visualvm
while it is in the 'black hole' state to see what is actually going on. I
don't know if the instance is used also for searching, but if its only
indexing, maybe just shorter commit intervals would alleviate the problem.
To add context, our indexer is configured with 16gb heap, on machine with
64gb ram, but busy one, so sometimes there is no cache to spare for os. The
index is 300gb (out of which 140gb stored values), and it is working just
'fine' - 30doc/s on average, but our docs are large /0.5mb on avg/ and
fetched from two databases, so the slowness is outside solr. I didnt see
big improvements with bigger heap, but I don't remember exact numbers. This
is solr4.

Roman
On 8 Feb 2014 12:23, Shawn Heisey s...@elyograg.org wrote:

 On 2/8/2014 1:40 AM, samarth s wrote:
  Yes it is amazon ec2 indeed.
 
  To expqnd on that,
  This solr deployment was working fine, handling the same load, on a 34 GB
  instance on ebs storage for quite some time. To reduce the time taken by
 a
  commit, I shifted this to a 30 GB SSD instance. It performed better in
  writes and commits for sure. But, since the last week I started facing
 this
  problem of infinite back to back commits. Not being able to resolve
 this, I
  have finally switched back to a 34 GB machine with ebs storage, and now
 the
  commits are working fine, though slow.

 The extra 4GB of RAM is almost guaranteed to be the difference.  If your
 index continues to grow, you'll probably be having problems very soon
 even with 34GB of RAM.  If you could put it on a box with 128 to 256GB
 of RAM, you'd likely see your performance increase dramatically.

 Can you share your solrconfig.xml file?  I may be able to confirm a
 couple of things I suspect, and depending on what's there, may be able
 to offer some ideas to help a little bit.  It's best if you use a file
 sharing site like dropbox - the list doesn't deal with attachments very
 well.  Sometimes they work, but most of the time they don't.

 I will reiterate my main point -- you really need a LOT more memory.
 Another option is to shard your index across multiple servers.  This
 doesn't actually reduce the TOTAL memory requirement, but it is
 sometimes easier to get management to agree to buy more servers than it
 is to get them to agree to buy really large servers.  It's a paradox
 that doesn't make any sense to me, but I've seen it over and over.

 Thanks,
 Shawn




Re: Commit Issue in Solr 3.4

2014-02-08 Thread Roman Chyla
Thanks for the links. I think it would be worth getting more detailed info.
Because it could be the performance threshold, or it could be st else /such
as updated java version or st else, loosely related to ram, eg what is held
in memory before the commit, what is cached, leaked custom query objects
with holding to some big object etc/. Btw if i study the graph, i see that
there *are* warning signs. That's the point of testing/measuring after all,
IMHO.

--roman
On 8 Feb 2014 13:51, Shawn Heisey s...@elyograg.org wrote:

 On 2/8/2014 11:02 AM, Roman Chyla wrote:
  I would be curious what the cause is. Samarth says that it worked for
 over
  a year /and supposedly docs were being added all the time/. Did the index
  grew considerably in the last period? Perhaps he could attach visualvm
  while it is in the 'black hole' state to see what is actually going on. I
  don't know if the instance is used also for searching, but if its only
  indexing, maybe just shorter commit intervals would alleviate the
 problem.
  To add context, our indexer is configured with 16gb heap, on machine with
  64gb ram, but busy one, so sometimes there is no cache to spare for os.
 The
  index is 300gb (out of which 140gb stored values), and it is working just
  'fine' - 30doc/s on average, but our docs are large /0.5mb on avg/ and
  fetched from two databases, so the slowness is outside solr. I didnt see
  big improvements with bigger heap, but I don't remember exact numbers.
 This
  is solr4.

 For this discussion, refer to this image, or the Google Books link where
 I originally found it:

 https://dl.dropboxusercontent.com/u/97770508/performance-dropoff-graph.png

 http://books.google.com/books?id=dUiNGYCiWg0Cpg=PA33#v=onepageqf=false

 Computer systems have had a long history of performance curves like
 this.  Everything goes really well, possibly for a really long time,
 until you cross some threshold where a resource cannot keep up with the
 demands being placed on it.  That threshold is usually something you
 can't calculate in advance.  Once it is crossed, even by a tiny amount,
 performance drops VERY quickly.

 I do recommend that people closely analyze their GC characteristics, but
 jconsole, jvisualvm, and other tools like that are actually not very
 good at this task.  You can only get summary info -- how many GCs
 occurred and total amount of time spent doing GC, often with a useless
 granularity -- jconsole reports the time in minutes on a system that has
 been running for any length of time.

 I *was* having occasional super-long GC pauses (15 seconds or more), but
 I did not know it, even though I had religiously looked at GC info in
 jconsole and jstat.  I discovered the problem indirectly, and had to
 find additional tools to quantify it.  After discovering it, I tuned my
 garbage collection and have not had the problem since.

 If you have detailed GC logs enabled, this is a good free tool for
 offline analysis:

 https://code.google.com/p/gclogviewer/

 I have also had good results with this free tool, but it requires a
 little more work to set up:

 http://www.azulsystems.com/jHiccup

 Azul Systems has an alternate Java implementation for Linux that
 virtually eliminates GC pauses, but it isn't free.  I do not have any
 information about how much it costs.  We found our own solution, but for
 those who can throw money at the problem, I've heard good things about it.

 Thanks,
 Shawn




Re: Solr4 performance

2014-02-12 Thread Roman Chyla
And perhaps one other, but very pertinent, recommendation is: allocate only
as little heap as is necessary. By allocating more, you are working against
the OS caching. To know how much is enough is bit tricky, though.

Best,

  roman


On Wed, Feb 12, 2014 at 2:56 PM, Shawn Heisey s...@elyograg.org wrote:

 On 2/12/2014 12:07 PM, Greg Walters wrote:

 Take a look at http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-
 on-64bit.html as it's a pretty decent explanation of memory mapped
 files. I don't believe that the default configuration for solr is to use
 MMapDirectory but even if it does my understanding is that the entire file
 won't be forcibly cached by solr. The OS's filesystem cache should control
 what's actually in ram and the eviction process will depend on the OS.


 I only have a little bit to add.  Here's the first thing that Uwe's blog
 post (linked above) says:

 Since version 3.1, *Apache Lucene*and *Solr *use MMapDirectoryby default
 on 64bit Windows and Solaris systems; since version 3.3 also for 64bit
 Linux systems.

 The default in Solr 4.x is NRTCachingDirectory, which uses MMapDirectory
 by default under the hood.

 A summary about all this that should be relevant to the original question:

 It's the *operating system* that handles memory mapping, including any
 caching that happens.  Assuming that you don't have a badly configured
 virtual machine setup, I'm fairly sure that only real memory gets used,
 never swap space on the disk.  If something else on the system makes a
 memory allocation, the operating system will instantly give up memory used
 for caching and mapping.  One of the strengths of mmap is that it can't
 exceed available resources unless it's used incorrectly.

 Thanks,
 Shawn




Re: APACHE SOLR: Pass a file as query parameter and then parse each line to form a criteria

2014-02-13 Thread Roman Chyla
Hi Rajeev,
You can take this:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3CCAEN8dyX_Am_v4f=5614eu35fnhb5h7dzkmkzdfwvrrm1xpq...@mail.gmail.com%3E

I haven't created the jira yet, but I have improved the plugin. Recently, I
have seen a use case of passing 90K identifiers /Over 1Mb/ as a filter
query - 3M docs were being filtered. It was blazing fast

Hth,

Roman
On 13 Feb 2014 00:12, rajeev.nadgauda rajeev.nadga...@leadenrich.com
wrote:

 Hi ,
 I am new to solr , i need help with the following

 PROBLEM: I have a huge file of 1 lines i want this to be an inclusion
 or
 exclusion in the query . i.e each line like ( line1 or line2 or ..)

 How can this be achieved in solr , is there a custom implementation that i
 would need to implement. Also will it help to implement a custom filter ?


 Thank You,
 Rajeev Nadgauda.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/APACHE-SOLR-Pass-a-file-as-query-parameter-and-then-parse-each-line-to-form-a-criteria-tp4117066.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: filtering/faceting by a big list of IDs

2014-02-13 Thread Roman Chyla
Hi Tri,
Look at this:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3CCAEN8dyX_Am_v4f=5614eu35fnhb5h7dzkmkzdfwvrrm1xpq...@mail.gmail.com%3E
Roman
On 13 Feb 2014 03:39, Tri Cao tm...@me.com wrote:

 Hi Joel,

 Thanks a lot for the suggestion.

 After thinking more about this, I think I could skip the faceting count
 for now,
 and so just provide a filtering option without display how many items that
 would
 be there after filtering. After all, even Google Shopping product search
 doesn't
 display the facet counts :) Given that, I think the easiest way is to add
 a new
 PostFilter to the query.

 Thanks again,
 Tri

 On Feb 12, 2014, at 12:03 PM, Joel Bernstein joels...@gmail.com wrote:

 Tri,

 You will most likely need to implement a custom QParserPlugin to
 efficiently handle what you described. Inside of this QParserPlugin you
 could create the logic that would bring in your outside list of ID's and
 build a DocSet that could be applied to the fq and the facet.query. I
 haven't attempted to use a QParserPlugin with a facet.query, but in theory
 it would work.

 With the filter query you also have the option of implementing your Query
 as a PostFilter. PostFilter logic is applied at collect time so the logic
 needs to only be applied to the documents that match the query. In many
 cause this can be faster, especially when result sets are relatively small
 but the index is large.


 Joel Bernstein
 Search Engineer at Heliosearch


 On Wed, Feb 12, 2014 at 2:12 PM, Tri Cao tm...@me.com wrote:

 Hi all,

 I am running a Solr application and I would need to implement a feature

 that requires faceting and filtering on a large list of IDs. The IDs are

 stored outside of Solr and is specific to the current logged on user. An

 example of this is the articles/tweets the user has read in the last few

 weeks. Note that the IDs here are the real document IDs and not Lucene

 internal docids.

 So the question is what would be the best way to implement this in Solr?

 The list could be as large as a ten of thousands of IDs. The obvious way of

 rewriting Solr query to add the ID list as facet.query and fq doesn't

 seem to be the best way because: a) the query would be very long, and b) it

 would surely exceed that the default limit of 1024 Boolean clauses and I

 am sure the limit is there for a reason.

 I had a similar problem before but back then I was using Lucene directly

 and the way I solved it is to use a MultiTermQuery to retrieve the internal

 docids from the ID list and then apply the resulting DocSet to counting and

 filtering. It was working reasonably for list of size ~10K, and with proper

 caching, it was working ok. My current application is very invested in Solr

 that going back to Lucene is not an option anymore.

 All advice/suggestion are welcomed.

 Thanks,

 Tri




Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-24 Thread Roman Chyla
perhaps useful, here is an open source implementation with near[digit]
support, incl analysis of proximity tokens. When days become longer maybe
itwill be packaged into a nice lib...:-)

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/grammars/ADS.g
On 25 Mar 2014 00:14, Salman Akram salman.ak...@northbaysolutions.net
wrote:

 Basically we just created this syntax for the ease of users, otherwise on
 back end it uses W or N operators.


 On Tue, Mar 25, 2014 at 4:21 AM, Ahmet Arslan iori...@yahoo.com wrote:

  Hi,
 
  There is no w/int syntax in surround.
  /* Query language operators: OR, AND, NOT, W, N, (, ), ^, *, ?,  and
  comma */
 
  Ahmet
 
 
 
  On Monday, March 24, 2014 9:46 PM, T. Kuro Kurosaka k...@healthline.com
 
  wrote:
  On 3/19/14 5:13 PM, Otis Gospodnetic wrote: Hi,
  
   Guessing it's surround query parser's support for within backed by
 span
   queries.
  
   Otis
 
  You mean this?
  http://wiki.apache.org/solr/SurroundQueryParser
 
  I guess this parser needs improvement in documentation area.
  It doesn't explain or have an example of the w/int syntax at all.
  (Is this the infix notation of W?)
  An example would help explaining difference between W and N;
  some readers may not understand what ordered and unordered
  in this context mean.
 
 
  Kuro
 



 --
 Regards,

 Salman Akram



Re: What is the usage of solr.NumericPayloadTokenFilterFactory

2014-05-17 Thread Roman Chyla
Hi, What will replace spans, if spans are nuked ?
Roman
On 17 May 2014 09:15, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,


 Payloads are used to store arbitrary data along with terms. You can
 influence score with these arbitrary data.
 See :
 http://sujitpal.blogspot.com.tr/2013/07/porting-payloads-to-solr4.html

 But remember that there is an ongoing work to nuke Spans.

 Ahmet



 On Saturday, May 17, 2014 8:24 AM, ienjreny ismaeel.enjr...@gmail.com
 wrote:
 Regarding to your question: That said, are you sure you want to be using
 the payload feature of Lucene? 

 I don't know because I don't know what is the benefits from this tokenizer,
 and what Payload means here!


 On Sat, May 17, 2014 at 2:45 AM, Jack Krupansky-2 [via Lucene] 
 ml-node+s472066n4136467...@n3.nabble.com wrote:

  I do have basic coverage for that filter (and all other filters) and the
  parameter values in my e-book:
 
 
 
 http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
 
  That said, are you sure you want to be using the payload feature of
  Lucene?
 
  -- Jack Krupansky
 
  -Original Message-
  From: ienjreny
  Sent: Monday, May 12, 2014 12:51 PM
  To: [hidden email] http://user/SendEmail.jtp?type=nodenode=4136467i=0
 
  Subject: What is the usage of solr.NumericPayloadTokenFilterFactory
 
  Dears:
  Can any body explain at easy way what is the benefits of
  solr.NumericPayloadTokenFilterFactory and what is acceptable values for
  typeMatch
 
  Thanks in advance
 
 
 
  --
  View this message in context:
 
 
 http://lucene.472066.n3.nabble.com/What-is-the-usage-of-solr-NumericPayloadTokenFilterFactory-tp4135326.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
  --
   If you reply to this email, your message will be added to the discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/What-is-the-usage-of-solr-NumericPayloadTokenFilterFactory-tp4135326p4136467.html
   To unsubscribe from What is the usage of
  solr.NumericPayloadTokenFilterFactory, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4135326code=aXNtYWVlbC5lbmpyZW55QGdtYWlsLmNvbXw0MTM1MzI2fC01NTkxMjYzODg=
 
  .
  NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/What-is-the-usage-of-solr-NumericPayloadTokenFilterFactory-tp4135326p4136597.html

 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Anti-Pattern in lucent-join jar?

2014-12-04 Thread Roman Chyla
+1, additionally (as it follows from your observation) the query can get
out of sync with the index, if eg it was saved for later use and ran
against newly opened searcher

Roman
On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote:

 Hello All,

 I have been doing a lot of research in building some custom queries and I
 have been looking at the Lucene Join library as a reference. I noticed
 something that I believe could actually have a negative side effect.

 Specifically I was looking at the JoinUtil.createJoinQuery(…) method and
 within that method you see the following code:

 TermsWithScoreCollector termsWithScoreCollector =
 TermsWithScoreCollector.create(fromField,
 multipleValuesPerDocument, scoreMode);
 fromSearcher.search(fromQuery, termsWithScoreCollector);

 As you can see, when the JoinQuery is being built, the code is executing
 the query that is wraps with it’s own collector to collect all the scores.
 If I were to write a query parser using this library (which someone has
 done here), doesn’t this reduce the benefit of the SOLR query cache? The
 wrapped query is being executing when the Join Query is being constructed,
 not when it is executed.

 Thanks

 Darin



Re: Anti-Pattern in lucent-join jar?

2014-12-05 Thread Roman Chyla
Hi Mikhail, I think you are right, it won't be problem for SOLR, but it is
likely an antipattern inside a lucene component. Because custom components
may create join queries, hold to them and then execute much later against a
different searcher. One approach would be to postpone term collection until
the query actually runs, I looked far and wide for appropriate place, but
only found createWeight() - but at least it does give developers NO
opportunity to shoot their feet! ;-)

Since it may serve as an inspiration to someone, here is a link:
https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101

roman

On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Thanks Roman! Let's expand it for the sake of completeness.
 Such issue is not possible in Solr, because caches are associated with the
 searcher. While you follow this design (see Solr userCache), and don't
 update what's cached once, there is no chance to shoot the foot.
 There were few caches inside of Lucene (old FieldCache,
 CachingWrapperFilter, ExternalFileField, etc), but they are properly mapped
 onto segment keys, hence it exclude such leakage across different
 searchers.

 On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com wrote:

  +1, additionally (as it follows from your observation) the query can get
  out of sync with the index, if eg it was saved for later use and ran
  against newly opened searcher
 
  Roman
  On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote:
 
   Hello All,
  
   I have been doing a lot of research in building some custom queries
 and I
   have been looking at the Lucene Join library as a reference. I noticed
   something that I believe could actually have a negative side effect.
  
   Specifically I was looking at the JoinUtil.createJoinQuery(…) method
 and
   within that method you see the following code:
  
   TermsWithScoreCollector termsWithScoreCollector =
   TermsWithScoreCollector.create(fromField,
   multipleValuesPerDocument, scoreMode);
   fromSearcher.search(fromQuery, termsWithScoreCollector);
  
   As you can see, when the JoinQuery is being built, the code is
 executing
   the query that is wraps with it’s own collector to collect all the
  scores.
   If I were to write a query parser using this library (which someone has
   done here), doesn’t this reduce the benefit of the SOLR query cache?
 The
   wrapped query is being executing when the Join Query is being
  constructed,
   not when it is executed.
  
   Thanks
  
   Darin
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: Anti-Pattern in lucent-join jar?

2014-12-05 Thread Roman Chyla
Not sure I understand. It is the searcher which executes the query, how
would you 'convince' it to pass the query? First the Weight is created,
weight instance creates scorer - you would have to change the API to do the
passing (or maybe not...?)
In my case, the relationships were across index segments, so I had to
collect them first - but in some other situations, when you look only at
the data inside one index segments, it _might_ be better to wait



On Fri, Dec 5, 2014 at 1:25 PM, Darin Amos dari...@gmail.com wrote:

 Couldn’t you just keep passing the wrapped query and searcher down to
 Weight.scorer()?

 This would allow you to wait until the query is executed to do term
 collection. If you want to protect against creating and executing the query
 with different searchers, you would have to make the query factory (or
 constructor) only visible to the query parser or parser plugin?

 I might not have followed you, this discussing challenges my understanding
 of Lucene and SOLR.

 Darin



  On Dec 5, 2014, at 12:47 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
  Hi Mikhail, I think you are right, it won't be problem for SOLR, but it
 is
  likely an antipattern inside a lucene component. Because custom
 components
  may create join queries, hold to them and then execute much later
 against a
  different searcher. One approach would be to postpone term collection
 until
  the query actually runs, I looked far and wide for appropriate place, but
  only found createWeight() - but at least it does give developers NO
  opportunity to shoot their feet! ;-)
 
  Since it may serve as an inspiration to someone, here is a link:
 
 https://github.com/romanchyla/montysolr/blob/master-next/contrib/adsabs/src/java/org/apache/lucene/search/SecondOrderQuery.java#L101
 
  roman
 
  On Fri, Dec 5, 2014 at 4:52 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:
 
  Thanks Roman! Let's expand it for the sake of completeness.
  Such issue is not possible in Solr, because caches are associated with
 the
  searcher. While you follow this design (see Solr userCache), and don't
  update what's cached once, there is no chance to shoot the foot.
  There were few caches inside of Lucene (old FieldCache,
  CachingWrapperFilter, ExternalFileField, etc), but they are properly
 mapped
  onto segment keys, hence it exclude such leakage across different
  searchers.
 
  On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
  +1, additionally (as it follows from your observation) the query can
 get
  out of sync with the index, if eg it was saved for later use and ran
  against newly opened searcher
 
  Roman
  On 4 Dec 2014 10:51, Darin Amos dari...@gmail.com wrote:
 
  Hello All,
 
  I have been doing a lot of research in building some custom queries
  and I
  have been looking at the Lucene Join library as a reference. I noticed
  something that I believe could actually have a negative side effect.
 
  Specifically I was looking at the JoinUtil.createJoinQuery(…) method
  and
  within that method you see the following code:
 
 TermsWithScoreCollector termsWithScoreCollector =
 TermsWithScoreCollector.create(fromField,
  multipleValuesPerDocument, scoreMode);
 fromSearcher.search(fromQuery, termsWithScoreCollector);
 
  As you can see, when the JoinQuery is being built, the code is
  executing
  the query that is wraps with it’s own collector to collect all the
  scores.
  If I were to write a query parser using this library (which someone
 has
  done here), doesn’t this reduce the benefit of the SOLR query cache?
  The
  wrapped query is being executing when the Join Query is being
  constructed,
  not when it is executed.
 
  Thanks
 
  Darin
 
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 




Re: Queries not supported by Lucene Query Parser syntax

2015-01-01 Thread Roman Chyla
Hi Leonid,

I didn't look into solr qparser for a long time, but I think you should be
able to combine different query parsers in one query. Look at the
SolrQueryParser code, maybe now you can specify custom query parser for
every clause (?), st like:

foo AND {!lucene}bar

I dont know, but worth exploring

There is an another implementation of a query language, for which I know it
allows to combine different query parsers in one (cause I wrote it), there
the query goes this way:

edismax(dog cat AND lucene((foo AND bar)~3))

meaning: use edismax to build the main query, but let lucene query parser
build the 3rd clause - the nested 'for and bar' (parsers are expressed as
function operators, so you can use any query parser there exist in SOLR)

it is here, https://issues.apache.org/jira/browse/LUCENE-5014, but that was
not reviewed/integrated either


So no, you are not always limited by the query parser - you can combine
them (in more or less limited fashion). But yes, the query parsers limit
the expressiveness of your query language, but not what can be searched
(they will all produce Query object).

Best,

  roman



On Thu, Jan 1, 2015 at 10:15 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Yes, you are always limited by the query parser syntax, but of course you
 can always write your own query parser as well.

 There is an open issue for an XML-based query parser that would give you
 greater control. but... it's not committed yet:
 https://issues.apache.org/jira/browse/SOLR-839

 -- Jack Krupansky

 On Thu, Jan 1, 2015 at 4:08 AM, Leonid Bolshinsky leonid...@gmail.com
 wrote:

  Hello,
 
  Are we always limited by the query parser syntax when passing a query
  string to Solr?
  What about the query elements which are not supported by the syntax?
  For example, BooleanQuery.setMinimumNumberShouldMatch(n) is translated by
  BooleanQuery.toString() into ~n. But this is not a valid query syntax. So
  how can we express this via query syntax in Solr?
 
  And more general question:
  Given a Lucene Query object which was built programatically by a legacy
  code (which is using Lucene and not Solr), is there any way to translate
 it
  into Solr query (which must be a string). As Query.toString() doesn't
 have
  to be a valid Lucene query syntax, does it mean that the Solr query
 string
  must to be manually translated from the Lucene query object? Is there any
  utility that performs this job? And, again, what about queries not
  supported by the query syntax, like CustomScoreQuery, PayloadTermQuery
  etc.? Are we always limited in Solr by the query parser syntax?
 
  Thanks,
  Leonid
 



Re: shards per disk

2015-01-20 Thread Roman Chyla
I think this makes sense to (ie. the setup), since the search is getting 1K
documents each time (for textual analysis, ie. they are probably large
docs), and use Solr as a storage (which is totally fine) then the parallel
multiple drive i/o shards speed things up. The index is probably large, so
it is unrealistic to have enough RAM to cache the most used parts (if they
are hitting different docs all the time). I'm curious, as Toke's points
out, what was the RAID configuration you ran it on initially.

Best,

roman

On Tue, Jan 20, 2015 at 12:43 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Nimrod Cohen [nimrod.co...@nice.com] wrote:
  We need to get 1K documents out of 100M documents each
  time we query solr and send them to text Analysis.
  First configuration had 8 shards on one RAD (Disk F) we
  got the 1K in around 15 seconds.
  Second configuration we removed the RAD and work on 8
  different disk each shard on one disk and get the 1K
  documents in 2-3 seconds.

 Which RAID level? 0, 1, maybe 5 or 6? If you did a RAID 0, it should be
 about the same performance as shards on individual disks, due to striping.
 If you did a RAID 1 with, for example, 2*4 disks, your performance would be
 markedly worse. If you did a RAID 1 of 8*1 disk, it would be better than
 individual drives as it would mitigate the slowest drive dictates overall
 speed problem. If your RAID is not really a RAID but instead JBOD or
 similar (http://en.wikipedia.org/wiki/Non-RAID_drive_architectures#JBOD),
 then the poor performance is to be expected as chances are all your data
 would reside on the same physical disk.

 Please describe your RAID setup in detail.

 Also, is 2-3 second response time satisfactory to you? If not, what are
 you aiming at?

 - Toke Eskildsen



New UI for SOLR-based projects

2015-01-30 Thread Roman Chyla
Hi everybody,

There exists a new open-source implementation of a search interface for
SOLR. It is written in Javascript (using Backbone), currently in version
v1.0.19 - but new features are constantly coming. Rather than describing it
in words, please see it in action for yourself at http://ui.adslabs.org -
I'd recommend exploring facets, the query form, and visualizations.

The code lives at: http://github.com/adsabs/bumblebee

Best,

  Roman


Re: New UI for SOLR-based projects

2015-01-30 Thread Roman Chyla
I gather from your comment that I should update readme, because there could
be people who would be inclined to use bumblebee development server in
production: Beware those who enter through this gate! :-)

Your point, that so far you haven't seen anybody share their middle layer
can be addressed by pointing to the following projects:

https://github.com/adsabs/solr-service
https://github.com/adsabs/adsws

These are also open source, we use them in production, and have oauth,
microservices, rest, and rate limits, we know it is not perfect, but what
is? ;-) pull requests welcome!

Thanks,

Roman
On 30 Jan 2015 21:51, Shawn Heisey apa...@elyograg.org wrote:

 On 1/30/2015 1:07 PM, Roman Chyla wrote:
  There exists a new open-source implementation of a search interface for
  SOLR. It is written in Javascript (using Backbone), currently in version
  v1.0.19 - but new features are constantly coming. Rather than describing
 it
  in words, please see it in action for yourself at http://ui.adslabs.org
 -
  I'd recommend exploring facets, the query form, and visualizations.
 
  The code lives at: http://github.com/adsabs/bumblebee

 I have no wish to trivialize the work you've done.  I haven't looked
 into the code, but a high-level glance at the documentation suggests
 that you've put a lot of work into it.

 I do however have a strong caveat for your users.  I'm the guy holding
 the big sign that says the end is near to anyone who will listen!

 By itself, this is an awesome tool for prototyping, but without some
 additional expertise and work, there are severe security implications.

 If this gets used for a public Internet facing service, the Solr server
 must be accessible from the end user's machine, which might mean that it
 must be available to the entire Internet.

 If the Solr server is not sitting behind some kind of intelligent proxy
 that can detect and deny aattempts to access certain parts of the Solr
 API, then Solr will be wide open to attack.  A knowledgeable user that
 has unfiltered access to a Solr server will be able to completely delete
 the index, change any piece of information in the index, or send denial
 of service queries that will make it unable to respond to legitimate
 traffic.

 Setting up such a proxy is not a trivial task.  I know that some people
 have done it, but so far I have not seen anyone share those
 configurations.  Even with such a proxy, it might still be possible to
 easily send denial of service queries.

 I cannot find any information in your README or the documentation links
 that mentions any of these concerns.  I suspect that many who
 incorporate this client into their websites will be unaware that their
 setup may be insecure, or how to protect it.

 Thanks,
 Shawn




Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
We've compared several projects before starting - AngularJS was on them,
 it is great for stuff where you could find components (already prepared)
but writing custom components was easier in other framworks (you need to
take this statement with grain of salt: it was specific to our situation),
but that was one year ago...

On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop vishal@gmail.com wrote:

 Thanks Roman... I will check it... Maybe it's off topic but how about
 Angular...
 On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Hi Vishal, Alexandre,
 
  Here is another one, using Backbone, just released v1.0.16
 
  https://github.com/adsabs/bumblebee
 
  you can see it in action: http://ui.adslabs.org/
 
  While it primarily serves our own needs, I tried to architect it to be
  extendible (within reasonable limits of code, man power)
 
  Roman
 
  On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
   That's very general question. So, the following are three random ideas
   just to get you started to think of options.
  
   *) spring.io (Spring Data Solr) + Vaadin
   *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
   too)
   *) http://projectblacklight.org/
  
   Regards,
  Alex.
   
   Sign up for my Solr resources newsletter at http://www.solr-start.com/
  
  
   On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com
 wrote:
I am new to SOLR and was able to configure, run samples as well as
 able
   to
index data using DIH (from database).
   
Just wondering if there are open source framework to query and
display/visualize.
   
Regards
  
 



Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
Hi Vishal, Alexandre,

Here is another one, using Backbone, just released v1.0.16

https://github.com/adsabs/bumblebee

you can see it in action: http://ui.adslabs.org/

While it primarily serves our own needs, I tried to architect it to be
extendible (within reasonable limits of code, man power)

Roman

On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 That's very general question. So, the following are three random ideas
 just to get you started to think of options.

 *) spring.io (Spring Data Solr) + Vaadin
 *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
 too)
 *) http://projectblacklight.org/

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
  I am new to SOLR and was able to configure, run samples as well as able
 to
  index data using DIH (from database).
 
  Just wondering if there are open source framework to query and
  display/visualize.
 
  Regards



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.

are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)

can you post the relevant part of your schema.xml?


note: I can confirm that multi-token synonym expansion can be made to
work, even in complex cases - we do it - but likely, if you need
multi-token synonyms, you will also need a smarter query parser.
sometimes your users will use query strings that contain overlapping
synonym entries, to handle that, you will have to know how to generate
all possible 'reads', example

synonym:

foo bar, foobar
hey foo, heyfoo

user input:

hey foo bar

possible readings:

((hey foo) +bar) OR (hey +(foo bar))

i'm simplifying it here, the fun starts when you are seeing a phrase query :)

On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
 Hi there,

 I tried the solution provided in
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
 .The mentioned solution works when the indexed data does not have alpha
 numerics or special characters. But in  my case the synonyms are something
 like the below.


  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
 MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
 SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
 300  POLYSORBATE
 20 [FHFI]  FEMA NO. 2915

 They have alpha numerics, special characters, spaces, etc. Is there a way
 to implment synonyms even in such case?

 Thanks,
 Kaushik

 On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
 daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at NLM.   We
 decided to use Solr for different projects instead. I considered the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the multiple
 term alternatives.
  - index the data, and then have an enrichment process that queries on
 each source synonym, and generates an update to add the target synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature that
 they use Solr's own tokenizer rather than duplicate it.   The enrichment
 process seems to me only workable in environments where you can re-index
 all data periodically, so no continuous stream of data to index that needs
 to be handled relatively quickly once it is generated.The last method
 of pre-processing the data seems the least desirable to me from a blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for multi
 term synonyms. Is that right? I have a use case where I need to map one
 multi term phrase to another. i.e. Tween 20 needs to be translated to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Pls post output of the request with debugQuery=true

Do you see the synonyms being expanded? Probably not.

You can go to the administer iface, in the analyzer section play with the
input until you see the synonyms. Use phrase queries too. That will be
helpful to elliminate autophrase filter
On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 Following is my use case:

 *Schema.xml*...

field name=name type=text_autophrase indexed=true stored=true/

 fieldType name=text_autophrase class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter
 class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
 phrases=autophrases.txt includeTokens=false
 replaceWhitespaceWith=X /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   /analyzer
 /fieldType

 *SolrConfig.xml...*

 name=/autophrase class=solr.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfname/str
  str name=defTypeautophrasingParser/str
/lst
   /requestHandler

   queryParser name=autophrasingParser
class=com.lucidworks.analysis.AutoPhrasingQParserPlugin 
 str name=phrasesautophrases.txt/str
 str name=replaceWhitespaceWithX/str
   /queryParser


 *Synonyms.txt*
 PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
 [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
 MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
 SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20
 [WHO-DD],POLYSORBATE 20 [VANDF]

 *Autophrase.txt...*

 Has all the above phrases in one column

 *Indexed document*

 doc
   field name=id31/field
   field name=namePolysorbate 20/field
   /doc

 So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to
 see the record containig Polysorbate 20. i.e.

 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
 should have retrieved it; but it doesnt.

 What could I be doing wrong?

 On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  I'm not sure I understand - the autophrasing filter will allow the
  parser to see all the tokens, so that they can be parsed (and
  multi-token synonyms) identified. So if you are using the same
  analyzer at query and index time, they should be able to see the same
  stuff.
 
  are you using multi-token synonyms, or just entries that look like
  multi synonym? (in the first case, the tokens are separated by null
  byte) - in the second case, they are just strings even with
  whitespaces, your synonym file must contain exactly the same entries
  as your analyzer sees them (and in the same order; or you have to use
  the same analyzer to load the synonym files)
 
  can you post the relevant part of your schema.xml?
 
 
  note: I can confirm that multi-token synonym expansion can be made to
  work, even in complex cases - we do it - but likely, if you need
  multi-token synonyms, you will also need a smarter query parser.
  sometimes your users will use query strings that contain overlapping
  synonym entries, to handle that, you will have to know how to generate
  all possible 'reads', example
 
  synonym:
 
  foo bar, foobar
  hey foo, heyfoo
 
  user input:
 
  hey foo bar
 
  possible readings:
 
  ((hey foo) +bar) OR (hey +(foo bar))
 
  i'm simplifying it here, the fun starts when you are seeing a phrase
 query
  :)
 
  On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
   Hi there,
  
   I tried the solution provided in
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
   .The mentioned solution works when the indexed data does not have alpha
   numerics or special characters. But in  my case the synonyms are
  something
   like the below.
  
  
T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Hi Kaushik, I meant to compare tween 20 against tween 20.

Your autophrase filter replaces whitespace with x, but your synonym filter
expects whitespaces. Try that.

Roman
On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 When I used the debugQuery using

 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
 I see the following in the response. The autophrase plugin seems to be
 doing its part. Just not the synonym expansion. When you say use phrase
 queries, what do you mean? Please clarify.

 response: {
 numFound: 0,
 start: 0,
 docs: []
   },
   debug: {
 rawquerystring: tween 20,
 querystring: tween 20,
 parsedquery: name:tweenx20,
 parsedquery_toString: name:tweenx20,
 explain: {},

 Thank you,

 Kaushik


 On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Pls post output of the request with debugQuery=true
 
  Do you see the synonyms being expanded? Probably not.
 
  You can go to the administer iface, in the analyzer section play with the
  input until you see the synonyms. Use phrase queries too. That will be
  helpful to elliminate autophrase filter
  On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:
 
   Hi Roman,
  
   Following is my use case:
  
   *Schema.xml*...
  
  field name=name type=text_autophrase indexed=true
  stored=true/
  
   fieldType name=text_autophrase class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
   filter
   class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
   phrases=autophrases.txt includeTokens=false
   replaceWhitespaceWith=X /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
   ignoreCase=true expand=true /
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true
 /
 /analyzer
 analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
   ignoreCase=true expand=true /
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true
 /
 /analyzer
   /fieldType
  
   *SolrConfig.xml...*
  
   name=/autophrase class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfname/str
str name=defTypeautophrasingParser/str
  /lst
 /requestHandler
  
 queryParser name=autophrasingParser
  
 class=com.lucidworks.analysis.AutoPhrasingQParserPlugin
  
   str name=phrasesautophrases.txt/str
   str name=replaceWhitespaceWithX/str
 /queryParser
  
  
   *Synonyms.txt*
   PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
   20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
   [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
   [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
   20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
   MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
   SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
   300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
 [FCC],POLYSORBATE
  20
   [WHO-DD],POLYSORBATE 20 [VANDF]
  
   *Autophrase.txt...*
  
   Has all the above phrases in one column
  
   *Indexed document*
  
   doc
 field name=id31/field
 field name=namePolysorbate 20/field
 /doc
  
   So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect
  to
   see the record containig Polysorbate 20. i.e.
  
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
   should have retrieved it; but it doesnt.
  
   What could I be doing wrong?
  
   On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.
   
are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)
   
can you post the relevant part of your schema.xml?
   
   
note

Re: Injecting synonymns into Solr

2015-05-04 Thread Roman Chyla
It shouldn't matter.  Btw try a url instead of a file path. I think the
underlying loading mechanism uses java File , it could work.
On May 4, 2015 2:07 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:

 Would like to check, will this method of splitting the synonyms into
 multiple files use up a lot of memory?

 I'm trying it with about 10 files and that collection is not able to be
 loaded due to insufficient memory.

 Although currently my machine only have 4GB of memory, but I only have
 500,000 records indexed, so not sure if there's a significant impact in the
 future (even with larger memory) when my index grows and other things like
 faceting, highlighting, and carrot tools are implemented.

 Regards,
 Edwin



 On 1 May 2015 at 11:08, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:

  Thank you for the info. Yup this works. I found out that we can't load
  files that are more than 1MB into zookeeper, as it happens to any files
  that's larger than 1MB in size, not just the synonyms files.
  But I'm not sure if there will be an impact to the system, as the number
  of synonym text file can potentially grow up to more than 20 since my
  sample synonym file size is more than 20MB.
 
  Currently I only have less than 500,000 records indexed in Solr, so not
  sure if there will be a significant impact as compared to one which has
  millions of records.
  Will try to get more records indexed and will update here again.
 
  Regards,
  Edwin
 
 
  On 1 May 2015 at 08:17, Philippe Soares soa...@genomequest.com wrote:
 
  Split your synonyms into multiple files and set the SynonymFilterFactory
  with a coma-separated list of files. e.g. :
  synonyms=syn1.txt,syn2.txt,syn3.txt
 
  On Thu, Apr 30, 2015 at 8:07 PM, Zheng Lin Edwin Yeo 
  edwinye...@gmail.com
  wrote:
 
   Just to populate it with the general synonym words. I've managed to
   populate it with some source online, but is there a limit to what it
 can
   contains?
  
   I can't load the configuration into zookeeper if the synonyms.txt file
   contains more than 2100 lines.
  
   Regards,
   Edwin
   On 1 May 2015 05:44, Chris Hostetter hossman_luc...@fucit.org
  wrote:
  
   
: There is a possible solution here:
: https://issues.apache.org/jira/browse/LUCENE-2347 (Dump WordNet
 to
   SOLR
: Synonym format).
   
If you have WordNet synonyms you do't need any special code/tools to
convert them -- the current solr.SynonymFilterFactory supports
 wordnet
files (just specify format=wordnet)
   
   
:   Does anyone knows any faster method of populating the
  synonyms.txt
file
:   instead of manually typing in the words into the file, which
  there
could
:  be
:   thousands of synonyms around?
   
populate from what?  what is hte source of your data?
   
the default solr synonym file format is about as simple as it could
possibly be -- pretty trivial to generate it from scripts -- the
 hard
   part
is usually selecting the synonym data you want to use and parsing
   whatever
format it is already in.
   
   
   
-Hoss
http://www.lucidworks.com/
   
  
 
 
 



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Brackets are range operators for the parser, you need to escape them \[ or
enclose in quotes.
 On Apr 29, 2015 10:27 PM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 Tween 20 also did not retrieve me results. So I replaced the whitespaces
 in the synonyms.txt with 'x' and now when I search, I get the results back.
 One problem however still exits. i.e. when I search for POLYSORBATE
 20[MART.] which is a synonym for POLYSORBATE 20, I get error as below,

 msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate
 20[mart.] ': Encountered \ \]\ \] \\ at line 1, column
 20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED
 ...\r\nRANGE_GOOP ...\r\n,
 code: 400

 If I am able to solve this, I think I am pretty close to the solution.
 Any thoughts there?

 I appreciate your help on this matter.

 Thank you,

 Kaushik



 On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Kaushik, I meant to compare tween 20 against tween 20.
 
  Your autophrase filter replaces whitespace with x, but your synonym
 filter
  expects whitespaces. Try that.
 
  Roman
  On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote:
 
   Hi Roman,
  
   When I used the debugQuery using
  
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
   I see the following in the response. The autophrase plugin seems to be
   doing its part. Just not the synonym expansion. When you say use phrase
   queries, what do you mean? Please clarify.
  
   response: {
   numFound: 0,
   start: 0,
   docs: []
 },
 debug: {
   rawquerystring: tween 20,
   querystring: tween 20,
   parsedquery: name:tweenx20,
   parsedquery_toString: name:tweenx20,
   explain: {},
  
   Thank you,
  
   Kaushik
  
  
   On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Pls post output of the request with debugQuery=true
   
Do you see the synonyms being expanded? Probably not.
   
You can go to the administer iface, in the analyzer section play with
  the
input until you see the synonyms. Use phrase queries too. That will
 be
helpful to elliminate autophrase filter
On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:
   
 Hi Roman,

 Following is my use case:

 *Schema.xml*...

field name=name type=text_autophrase indexed=true
stored=true/

 fieldType name=text_autophrase class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter
 class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
 phrases=autophrases.txt includeTokens=false
 replaceWhitespaceWith=X /
 filter class=solr.SynonymFilterFactory
   synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
   /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.SynonymFilterFactory
   synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
   /
   /analyzer
 /fieldType

 *SolrConfig.xml...*

 name=/autophrase class=solr.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfname/str
  str name=defTypeautophrasingParser/str
/lst
   /requestHandler

   queryParser name=autophrasingParser

   class=com.lucidworks.analysis.AutoPhrasingQParserPlugin

 str name=phrasesautophrases.txt/str
 str name=replaceWhitespaceWithX/str
   /queryParser


 *Synonyms.txt*
 PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN
 MONOLAURATE,TWEEN
 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
 [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
 MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
 SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
   [FCC],POLYSORBATE
20
 [WHO-DD],POLYSORBATE 20 [VANDF]

 *Autophrase.txt...*

 Has all the above phrases in one column

 *Indexed document*

 doc
   field name=id31/field

Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Roman Chyla
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.com wrote:

 Hi everyone,

 I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
 through grandchild documents during a search through the parents and filter
 out documents based on statistics gathered from aggregating the
 grandchildren together.  I've been successful in getting the logic correct,
 but it does not perform so well - I'm grabbing too many documents from the
 index along the way.  I'm trying to filter out grandchild documents which
 are not relevant to the statistics I'm collecting, in order to reduce the
 number of document objects pulled from the IndexReader.

 I've implemented the following code in my DelegatingCollector.collect:

 if (inStockSkusBitSet == null) {
 SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
 IndexSearcher to expose getDocSet.
 inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
 inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
 DocSet to expose getBits.
 inStockSkusBitSet = inStockSkusBitDocSet.getBits();
 }


 My BitDocSet reports a size which matches a standard query for the more
 limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
 reports this same cardinality.  Based on that fact, it seems that the
 getDocSet call itself must be working properly, and returning the right
 number of documents.  However, when I try to filter out grandchild
 documents using either BitDocSet.exists or BitSet.get (passing over any
 grandchild document which doesn't exist in the bitdocset or return true
 from the bitset), I get about 1/3 less results than I'm supposed to.   It
 seems many documents that should match the filter, are being excluded, and
 documents which should not match the filter, are being included.

 I'm trying to use it either of these ways:

 if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
 if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

 The currentChildDocNumber is simply the docNumber which is passed to
 DelegatingCollector.collect, decremented until I hit a document that
 doesn't belong to the parent document.

 I can't seem to figure out a way to actually use the BitDocSet (or its
 derivatives) to quickly eliminate document IDs.  It seems like this is how
 it's supposed to be used.  What am I getting wrong?

 Sorry if this is a newbie question, I've never written a PostFilter
 before, and frankly, the documentation out there is a little sketchy
 (mostly for version 4) - so many classes have changed names and so many of
 the more well-documented techniques are deprecated or removed now, it's
 tough to follow what the current best practice actually is.  I'm using the
 block join functionality heavily so I'm trying to keep more current than
 that.  I would be happy to send along the full source privately if it would
 help figure this out, and plan to write up some more elaborate instructions
 (updated for Solr 5) for the next person who decides to write a PostFilter
 and work with block joins, if I ever manage to get this performing well
 enough.

 Thanks for any pointers!  Totally open to doing this an entirely different
 way.  I read DocValues might be a more elegant approach but currently that
 would require reindexing, so trying to avoid that.

 Also, I've been wondering if the query above would read from the filter
 cache or not.  The query is constructed like this:


 private Term inStockTrueTerm = new Term(sku_history.is_in_stock,
 T);
 private Term objectTypeSkuHistoryTerm = new Term(object_type,
 sku_history);
 ...

 inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
 objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
 inStockSkusQuery = new BooleanQuery();
 inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
 inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
 BooleanClause.Occur.MUST);
 --
 Steve

 

 WGSN is a global foresight business. Our experts provide deep insight and
 analysis of consumer, fashion and design trends. We inspire our clients to
 plan and trade their range with unparalleled confidence and accuracy.
 Together, we Create Tomorrow.

 WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of
 market-leading products including WGSN.comhttp://www.wgsn.com, WGSN
 Lifestyle  Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN
 INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
 http://www.wgsn.com/en/styletrial/ and WGSN Mindset
 http://www.wgsn.com/en/services/consultancy/, our bespoke consultancy
 services.

 The information in or attached to this email is 

Re: Forking Solr

2015-10-17 Thread Roman Chyla
I've taken the route of extending solr, the repo checks out solr and builds
on top of that. The hard part was to figure out how to use solr test
classes and the default location for integration tests, but once there, it
is relatively easy. Google for montysolr, the repo is on github.
Roman
On Oct 16, 2015 10:52 PM, "Upayavira"  wrote:

>
>
> On Fri, Oct 16, 2015, at 04:00 PM, Ryan Josal wrote:
> > Thanks for the feedback, forking lucene/solr is my last resort indeed.
> >
> > 1) It's not about creating fresh new plugins.  It's about modifying
> > existing ones or core solr code.
> > 2) I want to submit the patch to modify core solr or lucene code, but I
> > also want to run it in prod before its accepted and released publicly.
> > Also I think this helps solidify the patch over time.
> > 3) I have to do this all the time, and I agree it's better than forking,
> > but doing this repeatedly over time has diminishing returns because it
> > increases the cost of upgrading solr.  I also requires some ugly
> > reflection
> > in most cases, and in others copying verbatim a pile of other classes.
>
> If you want to patch a component, change its package name and fork that
> component. I have a custom MoreLikeThisHandler in production quite
> happily like this.
>
> I've also done an SVN checkout of Solr, made my code changes there, and
> then created a local git repo that I can track my own changes for stuff
> that will eventually get pushed back to Solr.
>
> I work concurrently on a number of patches to the Admin UI. They tend to
> sit in different JIRAs as patches for a few days before I commit them,
> so this local git repo makes it much easier for me to track my changes,
> but from the Solr community's perspective, I'm just using SVN.
>
> I could easily push this git repo up to github or such if I thought that
> added value.
>
> Then, I regularly run svn update which keeps this checkout up-to-date,
> and confirm it hasn't broken things.
>
> If you wanted to run against a specific version in Solr, you could force
> SVN to a specific revision (e.g. of the 5x branch) - the one that was
> released, and git merge your patches into it, etc, etc, etc.
>
> Upayavira
>


Re: Scramble data

2015-10-08 Thread Roman Chyla
Or you could also apply XSL to returned records:
https://wiki.apache.org/solr/XsltResponseWriter


On Thu, Oct 8, 2015 at 5:06 PM, Uwe Reh  wrote:
> Hi,
>
> my suggestions are probably to simple, because they are not a real
> protection of privacy. But maybe one fits to your needs.
>
> Most simple:
> Declare your 'hidden' fields just as "indexed=true stored=false", the data
> will be used for searching, but the fields are not listed in the query
> response.
> Cons: The Terms of the fields can be still examined by advanced users. As
> example they could use the field as facet.
>
> Very simple
> Use a PhoneticFilter for indexing and searching. The encoding
> "ColognePhonetic" generates a numeric hash for each term. The name
> "Breschnew" will be saved as "17863".
> Cons: Phonetic similaritys will lead to false hits. This hashing is really
> only scrambling and not appropriate as security feature.
>
> Simple
> Declare a special SearchHandlers in your solrconfig.xml and define an
> invariant fieldList parameter. This should contain just the public subset of
> your fields.
> Cons: I'm not really sure, about this.
>
> Still quite simple
> Write a own Filter, which generates real cryptographic hashes
> Cons: If the entropy of your data is poor, you may need additional tricks
> like padding the data. This filter may slow down your system.
>
>
> Last but not least be aware, that the searching could be a way to restore
> hidden informations. If a query for "billionaire" just get one hit, it's
> obvious that "billionaire" is an attribute of the document even if it is not
> listed in the result.
>
> Uwe


Re: Reverse query?

2015-10-02 Thread Roman Chyla
I'd like to offer another option:

you say you want to match long query into a document - but maybe you
won't know whether to pick "Mad Max" or "Max is" (not mentioning the
performance hit of "*mad max*" search - or is it not the case
anymore?). Take a look at the NGram tokenizer (say size of 2; or
bigger). What it does, it splits the input into overlapping segments
of 'X' words (words, not characters - however, characters work too -
just pick bigger N)

mad max
max 1979
1979 australian

i'd recommend placing stopfilter before the ngram

 - then for the long query string of "Hey Mad Max is 1979" you
wold search "hey mad" OR "mad max" OR "max 1979"... (perhaps the query
tokenizer could be convinced to the search for you automatically). And
voila, the more overlapping segments there, the higher the search
result.

hth,

roman



On Fri, Oct 2, 2015 at 12:03 PM, Erick Erickson  wrote:
> The admin/analysis page is your friend here, find it and use it ;)
> Note you have to select a core on the admin UI screen before you can
> see the choice.
>
> Because apart from the other comments, KeywordTokenizer is a red flag.
> It does NOT break anything up into tokens, so if your doc contains:
> Mad Max is a 1979 Australian
> as the whole field, the _only_ match you'll ever get is if you search exactly
> "Mad Max is a 1979 Australian"
> Not Mad, not mad, not Max, exactly all 6 words separated by exactly one space.
>
> Andrea's suggestion is the one you want, but be sure you use one of
> the tokenizing analysis chains, perhaps start with text_en (in the
> stock distro). Be sure to completely remove your node/data directory
> (as in rm -rf data) after you make the change.
>
> And really, explore the admin/analysis page; it's where a LOT of these
> kinds of problems find solutions ;)
>
> Best,
> Erick
>
> On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr  wrote:
>> Hello Remi,
>> Iam assuming the field where you store the data is analyzed.
>> The field definition might help us answer your question better. If you are
>> using edismax handler for your search requests, I believe you can achieve
>> you goal by setting set your "mm" to 100%, phrase slop "ps" and query slop
>> "qs" parameters to zero. I think that will force exact matches.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
>> andrearoggerone.o...@gmail.com> wrote:
>>
>>> Hi Remy,
>>> The question is not really clear, could you explain a little bit better
>>> what you need? Reading your email I understand that you want to get
>>> documents containing all the search terms typed. For instance if you search
>>> for "Mad Max", you wanna get documents containing both Mad and Max. If
>>> that's your need, you can use a phrase query like:
>>>
>>> *"*Mad Max*"~2*
>>>
>>> where enclosing your keywords between double quotes means that you want to
>>> get both Mad and Max and the optional parameter ~2 is an example of *slop*.
>>> If you need more info you can look for *Phrase Query* in
>>> https://wiki.apache.org/solr/SolrRelevancyFAQ
>>>
>>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing 
>>> wrote:
>>>
>>> > Hi,
>>> > I have medium-low experience on Solr and I have a question I couldn't
>>> quite
>>> > solve yet.
>>> >
>>> > Typically we have quite short query strings (a couple of words) and the
>>> > search is done through a set of bigger documents. What if the logic is
>>> > turned a little bit around. I have a document and I need to find out what
>>> > strings appear in the document. A string here could be a person name
>>> > (including space for example) or a location...which are indexed in Solr.
>>> >
>>> > A concrete example, we take this text from wikipedia (Mad Max):
>>> > "*Mad Max is a 1979 Australian dystopian action film directed by George
>>> > Miller .
>>> > Written by Miller and James McCausland from a story by Miller and
>>> producer
>>> > Byron Kennedy , it tells a
>>> > story of societal breakdown
>>> > , murder, and vengeance
>>> > . The film, starring the
>>> > then-little-known Mel Gibson ,
>>> > was released internationally in 1980. It became a top-grossing Australian
>>> > film, while holding the record in the Guinness Book of Records
>>> >  for decades as
>>> > the
>>> > most profitable film ever created,[1]
>>> >  and
>>> > has
>>> > been credited for further opening the global market to Australian New
>>> Wave
>>> >  films.*
>>> > 
>>> > 

Jetty refuses connections

2016-05-16 Thread Roman Chyla
Hi,

I'm hoping someone has seen/encountered a similar problem. We have
solr instances with all Jetty threads in BLOCKED state. The
application does not respond to any http requests.

It is SOLR 4.9 running inside docker on Amazon EC2. Jetty is 8.1 and
there is an nginx proxy in front of it (with persistent connections).
The machine has plenty of RAM, the load is minimal - it seems, that
idling triggers the condition. It is a slave replicating from master
without issues, no OOM errors and actually no other errors at all...

After some time, pretty much all threads will have this:

```
Thread 186: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame;
information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
long) @bci=20, line=215 (Compiled frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long)
@bci=78, line=2078 (Compiled frame)
 - java.util.concurrent.ArrayBlockingQueue.poll(long,
java.util.concurrent.TimeUnit) @bci=49, line=418 (Compiled frame)
 - org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll()
@bci=12, line=526 (Compiled frame)
 - 
org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(org.eclipse.jetty.util.thread.QueuedThreadPool)
@bci=1, line=44 (Compiled frame)
 - org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() @bci=275,
line=572 (Compiled frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
```

there is only one thread relevant to solr - waiting on
`org.apache.solr.core.CloserThread.run()` - which I believe is
expected

I have experimented with jetty.server.nio.SelectChannelConnector and
the default bio.Selector. The ArrayBlockingQueue has limit of 6000.
Next I'll try to run jetty QueueThreadPool with 100 (the default is -1)

...but it takes many days to trigger...maybe some of you already know
the solution?

Thanks!

  Roman


Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
Joel, thanks, but which of them? I've counted at least 4, if not more,
different ways of how to get DocValues. Are there many functionally
equal approaches just because devs can't agree on using one api? Or is
there a deeper reason?

Btw, the FieldCache is still there - both in lucene (to be deprecated)
and in solr; but became package accessible only

This is what removed the FieldCache:
https://issues.apache.org/jira/browse/LUCENE-5666
This is what followed: https://issues.apache.org/jira/browse/SOLR-8096

And there is still code which un-inverts data from an index if no
doc-values are available.

--roman

On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein <joels...@gmail.com> wrote:
> You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
> replaced the field cache.
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla <roman.ch...@gmail.com> wrote:
>
>> I need to read data from the index in order to build a special cache.
>> Previously, in SOLR4, this was accomplished with FieldCache or
>> DocTermOrds
>>
>> Now, I'm struggling to see what API to use, there is many of them:
>>
>> on lucene level:
>>
>> UninvertingReader.getNumericDocValues (and others)
>> .getNumericValues()
>> MultiDocValues.getNumericValues()
>> MultiFields.getTerms()
>>
>> on solr level:
>>
>> reader.getNumericValues()
>> UninvertingReader.getNumericDocValues()
>> and extensions to FilterLeafReader - eg. very intersting, but
>> undocumented facet accumulators (ex: NumericAcc)
>>
>>
>> I need this for solr, and ideally re-use the existing cache [ie. the
>> special cache is using another fields so those get loaded only once
>> and reused in the old solr; which is a win-win situation]
>>
>> If I use reader.getValues() or FilterLeafReader will I be reading data
>> every time the object is created? What would be the best way to read
>> data only once?
>>
>> Thanks,
>>
>> --roman
>>


The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Roman Chyla
I need to read data from the index in order to build a special cache.
Previously, in SOLR4, this was accomplished with FieldCache or
DocTermOrds

Now, I'm struggling to see what API to use, there is many of them:

on lucene level:

UninvertingReader.getNumericDocValues (and others)
.getNumericValues()
MultiDocValues.getNumericValues()
MultiFields.getTerms()

on solr level:

reader.getNumericValues()
UninvertingReader.getNumericDocValues()
and extensions to FilterLeafReader - eg. very intersting, but
undocumented facet accumulators (ex: NumericAcc)


I need this for solr, and ideally re-use the existing cache [ie. the
special cache is using another fields so those get loaded only once
and reused in the old solr; which is a win-win situation]

If I use reader.getValues() or FilterLeafReader will I be reading data
every time the object is created? What would be the best way to read
data only once?

Thanks,

--roman


Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
in case this helps someone, here is a solution (probably very
efficient already, but i didn't profile it); it can deal with DocValues and
with FieldCache (the old 'stored' values)



private void unInvertedTheDamnThing(
  SolrIndexSearcher searcher,
  List fields,
  KVSetter setter) throws IOException {

LeafReader reader = searcher.getLeafReader();
  IndexSchema schema = searcher.getCore().getLatestSchema();
  List leaves = reader.getContext().leaves();

  Bits liveDocs;
  LeafReader lr;
  Transformer transformer;
for (LeafReaderContext leave: leaves) {
   int docBase = leave.docBase;
   liveDocs = leave.reader().getLiveDocs();
   lr = leave.reader();
   FieldInfos fInfo = lr.getFieldInfos();

   for (String field: fields) {

 FieldInfo fi = fInfo.fieldInfo(field);
 SchemaField fSchema = schema.getField(field);
 DocValuesType fType = fi.getDocValuesType();
 Map<String,Type> mapping = new HashMap<String,Type>();
 final LeafReader unReader;

 if (fType.equals(DocValuesType.NONE)) {
   Class c = fType.getClass();
  if (c.isAssignableFrom(TextField.class) ||
c.isAssignableFrom(StrField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED);
}
else {
  mapping.put(field, Type.BINARY);
}
  }
  else if (c.isAssignableFrom(TrieIntField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED_SET_INTEGER);
}
else {
  mapping.put(field, Type.INTEGER_POINT);
}
  }
  else {
continue;
  }
  unReader = new UninvertingReader(lr, mapping);
 }
 else {
   unReader = lr;
 }

switch(fType) {
   case NUMERIC:
 transformer = new Transformer() {
   NumericDocValues dv = unReader.getNumericDocValues(field);
   @Override
  public void process(int docBase, int docId) {
int v = (int) dv.get(docId);
setter.set(docBase, docId, v);
  }
 };
 break;
   case SORTED_NUMERIC:
 transformer = new Transformer() {
  SortedNumericDocValues dv =
unReader.getSortedNumericDocValues(field);
  @Override
  public void process(int docBase, int docId) {
dv.setDocument(docId);
int max = dv.count();
int v;
for (int i=0; i<max; i++) {
  v = (int) dv.valueAt(i);
  setter.set(docBase, docId, v);
}
  }
};
 break;
   case SORTED_SET:
 transformer = new Transformer() {
  SortedSetDocValues dv = unReader.getSortedSetDocValues(field);
  int errs = 0;
  @Override
  public void process(int docBase, int docId) {
if (errs > 5)
  return;
dv.setDocument(docId);
for (long ord = dv.nextOrd(); ord !=
SortedSetDocValues.NO_MORE_ORDS; ord = dv.nextOrd()) {
  final BytesRef value = dv.lookupOrd(ord);
  setter.set(docBase, docId, value.utf8ToString());
}
  }
};
 break;
   case SORTED:
 transformer = new Transformer() {
   SortedDocValues dv = unReader.getSortedDocValues(field);
  TermsEnum te;
  @Override
  public void process(int docBase, int docId) {
BytesRef v = dv.get(docId);
if (v.length == 0)
  return;
setter.set(docBase, docId, v.utf8ToString());
  }
};
 break;
   default:
 throw new IllegalArgumentException("The field " + field + "
is of type that cannot be un-inverted");
 }

 int i = 0;
while(i < lr.maxDoc()) {
  if (liveDocs != null && !(i < liveDocs.length() && liveDocs.get(i))) {
i++;
continue;
  }
  transformer.process(docBase, i);
  i++;
    }
   }

  }
}

On Wed, Aug 17, 2016 at 1:22 PM, Roman Chyla <roman.ch...@gmail.com> wrote:
> Joel, thanks, but which of them? I've counted at least 4, if not more,
> different ways of how to get DocValues. Are there many functionally
> equal approaches just because devs can't agree on using one api? Or is
> there a deeper reason?
>
> Btw, the FieldCache is still there - both in lucene (to be deprecated)
> and in solr; but became package accessible only
>
> This is what removed the FieldCache:
> https://issues.apache.org/jira/browse/LUCENE-5666
> This is what followed: https://issues.apache.org/jira/browse/SOLR-8096
>
> And there is still code which un-inverts data from an index

storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
Hello,

We have a use case of a very large index (slave-master; for unrelated
reasons the search cannot work in the cloud mode) - one of the fields is a
very large text, stored mostly for highlighting. To cut down the index size
(for purposes of replication/scaling) I thought I could try to save it in a
database - and not in the index.

Lucene has codecs - one of the methods is for 'stored field', so that seems
likes a natural path for me.

However, I'd expect somebody else before had a similar problem. I googled
and couldn't find any solutions. Using the codecs seems really good thing
for this particular problem, am I missing something? Is there a better way
to cut down on index size? (besides solr cloud/sharding, compression)

Thank you,

   Roman


Re: storing large text fields in a database? (instead of inside index)

2018-02-21 Thread Roman Chyla
Hi and thanks, Emir! FieldType might indeed be another layer where the
logic could live.

On Wed, Feb 21, 2018 at 6:32 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi,
> Maybe you could use external field type as an example how to hook up
> values from DB: https://lucene.apache.org/solr/guide/6_6/working-with-
> external-files-and-processes.html <https://lucene.apache.org/
> solr/guide/6_6/working-with-external-files-and-processes.html>
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 20 Feb 2018, at 20:39, Roman Chyla <roman.ch...@gmail.com> wrote:
> >
> > Say there is a high load and  I'd like to bring a new machine and let it
> > replicate the index, if 100gb and more can be shaved, it will have a
> > significant impact on how quickly the new searcher is ready and added to
> > the cluster. Impact on the search speed is likely minimal.
> >
> > we are investigating the idea of two clusters but i have to say it seems
> to
> > me more complex than storing/loading a field from an external source.
> > having said that, I wonder why this was not done before (maybe it was)
> and
> > what the cons are (besides the obvious ones: maintenance and the database
> > being potential point of failure; well in that case i'd miss highlights -
> > can live with that...)
> >
> > On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
> > hastings.recurs...@gmail.com> wrote:
> >
> >> Really depends on what you consider too large, and why the size is a big
> >> issue, since most replication will go at about 100mg/second give or
> take,
> >> and replicating a 300GB index is only an hour or two.  What i do for
> this
> >> purpose is store my text in a separate index altogether, and call on
> that
> >> core for highlighting.  So for my use case, the primary index with no
> >> stored text is around 300GB and replicates as needed, and the full text
> >> indexes with stored text totals around 500GB and are replicating non
> stop.
> >> All searching goes against the primary index, and for highlighting i
> call
> >> on the full text indexes that have a stupid simple schema.  This has
> worked
> >> for me pretty well at least.
> >>
> >> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <roman.ch...@gmail.com>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> We have a use case of a very large index (slave-master; for unrelated
> >>> reasons the search cannot work in the cloud mode) - one of the fields
> is
> >> a
> >>> very large text, stored mostly for highlighting. To cut down the index
> >> size
> >>> (for purposes of replication/scaling) I thought I could try to save it
> >> in a
> >>> database - and not in the index.
> >>>
> >>> Lucene has codecs - one of the methods is for 'stored field', so that
> >> seems
> >>> likes a natural path for me.
> >>>
> >>> However, I'd expect somebody else before had a similar problem. I
> googled
> >>> and couldn't find any solutions. Using the codecs seems really good
> thing
> >>> for this particular problem, am I missing something? Is there a better
> >> way
> >>> to cut down on index size? (besides solr cloud/sharding, compression)
> >>>
> >>> Thank you,
> >>>
> >>>   Roman
> >>>
> >>
>
>


Re: storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
Say there is a high load and  I'd like to bring a new machine and let it
replicate the index, if 100gb and more can be shaved, it will have a
significant impact on how quickly the new searcher is ready and added to
the cluster. Impact on the search speed is likely minimal.

we are investigating the idea of two clusters but i have to say it seems to
me more complex than storing/loading a field from an external source.
having said that, I wonder why this was not done before (maybe it was) and
what the cons are (besides the obvious ones: maintenance and the database
being potential point of failure; well in that case i'd miss highlights -
can live with that...)

On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
hastings.recurs...@gmail.com> wrote:

> Really depends on what you consider too large, and why the size is a big
> issue, since most replication will go at about 100mg/second give or take,
> and replicating a 300GB index is only an hour or two.  What i do for this
> purpose is store my text in a separate index altogether, and call on that
> core for highlighting.  So for my use case, the primary index with no
> stored text is around 300GB and replicates as needed, and the full text
> indexes with stored text totals around 500GB and are replicating non stop.
> All searching goes against the primary index, and for highlighting i call
> on the full text indexes that have a stupid simple schema.  This has worked
> for me pretty well at least.
>
> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
> > Hello,
> >
> > We have a use case of a very large index (slave-master; for unrelated
> > reasons the search cannot work in the cloud mode) - one of the fields is
> a
> > very large text, stored mostly for highlighting. To cut down the index
> size
> > (for purposes of replication/scaling) I thought I could try to save it
> in a
> > database - and not in the index.
> >
> > Lucene has codecs - one of the methods is for 'stored field', so that
> seems
> > likes a natural path for me.
> >
> > However, I'd expect somebody else before had a similar problem. I googled
> > and couldn't find any solutions. Using the codecs seems really good thing
> > for this particular problem, am I missing something? Is there a better
> way
> > to cut down on index size? (besides solr cloud/sharding, compression)
> >
> > Thank you,
> >
> >Roman
> >
>


<    1   2