Re: Boosting documents by categorical preferences

2014-01-30 Thread Amit Nithian
Chris,

Sounds good! Thanks for the tips.. I'll be glad to submit my talk to this
as I have a writeup pretty much ready to go.

Cheers
Amit


On Tue, Jan 28, 2014 at 11:24 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : The initial results seem to be kinda promising... of course there are
 many
 : more optimizations I could do like decay user ratings over time to
 indicate
 : that preferences decay over time so a 5 rating a year ago doesn't count
 as
 : much as a 5 rating today.
 :
 : Hope this helps others. I'll open source what I have soon and post back.
 If
 : there is feedback or other thoughts let me know!

 Hey Amit,

 Glad to hear your user based boosting experiments are paying off.  I would
 definitely love to see a more detailed writeup down the road showing off
 how it affects your final user metrics -- or perhaps even give a session
 on your technique at ApacheCon?


 http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


 -Hoss
 http://www.lucidworks.com/



Re: Boosting documents by categorical preferences

2014-01-27 Thread Amit Nithian
Hi Chris (and others interested in this),

Sorry for dropping off.. I got sidetracked with other work and came back to
this and finally got a V1 of this implemented.

The final process is as follows:
1) Pre-compute the global categorical num_ratings/average/std-dev (so for
Action the average rating may be 3.49 with stdDev of .99)
2) For a given user, retrieve the last X (X for me is 10) ratings and
compute the user's categorical affinities by taking the average rating for
all movies in that particular category (Action) subtract the global cat
average and divide by cat std_dev. Furthermore, multiply this by the
fraction of total user ratings in that category.
   - For example, if a user's last 10 ratings consisted of 9/10 Drama and
1/10 Thriller, the z-score of the Thriller should be discounted relative to
that of the Drama so that it's more prominent the user's preference (either
positive or negative) to Drama.
3) Sort by the absolute value of the z-score (Thanks Hossman.. great
thought).
4) Return the top 3 (arbitrary number)
5) Modify the query to look like the following:

qq=tom hanksq={!boost b=$b defType=edismax
v=$qq}cat1=category:Childrencat2=category:Fantasycat3=category:Animationb=sum(1,sum(product(query($cat1),0.22267872),product(query($cat2),0.21630952),product(query($cat3),0.21120241)))

basically b = 1+(pref1*query(category:something1) +
pref2*query(category:something2) + pref3*query(category:something3))

The initial results seem to be kinda promising... of course there are many
more optimizations I could do like decay user ratings over time to indicate
that preferences decay over time so a 5 rating a year ago doesn't count as
much as a 5 rating today.

Hope this helps others. I'll open source what I have soon and post back. If
there is feedback or other thoughts let me know!

Cheers
Amit


On Fri, Nov 22, 2013 at 11:38 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I thought about that but my concern/question was how. If I used the pow
 : function then I'm still boosting the bad categories by a small
 : amount..alternatively I could multiply by a negative number but does that
 : work as expected?

 I'm not sure i understand your concern: negative powers would give you
 values less then 1, positive powers would give you values greater then 1,
 and then you'd use those values as multiplicitive boosts -- so the values
 less then 1 would penalize the scores of existing matching docs in the
 categories the user dislikes.

 Oh wait ... i see, in your original email (and in my subsequent suggested
 tweak to use pow()) you were talking about sum()ing up these 3 category
 boosts (and i cut/pasted sum() in my example as well) ... yeah,
 using multiplcation there would make more sense if you wanted to do the
 negative prefrences as well, because then then score of any matching doc
 will be reduced if it matches on an undesired category -- and the
 amount it will be reduced will be determined by how strongly it
 matches on that category (ie: the base score returned by the nested
 query() func) and how negative the undesired prefrence value (ie:
 the pow() exponent) is


 qq=...
 q={!boost b=$b v=$qq}

 b=prod(pow(query($cat1,cat1z)),pow(query($cat2,cat2z)),pow(query($cat3,cat3z))
 cat1=...action...
 cat1z=1.48
 cat2=...comedy...
 cat2z=1.33
 cat3=...kids...
 cat3z=-1.7


 -Hoss



Re: Boosting documents by categorical preferences

2013-11-20 Thread Amit Nithian
I thought about that but my concern/question was how. If I used the pow
function then I'm still boosting the bad categories by a small
amount..alternatively I could multiply by a negative number but does that
work as expected?

I haven't done much with negative boosting except for the sledgehammer
approach of category exclusion through filters.

Thanks
Amit
On Nov 19, 2013 8:51 AM, Chris Hostetter hossman_luc...@fucit.org wrote:

 : My approach was something like:
 : 1) Look at the categories that the user has preferred and compute the
 : z-score
 : 2) Pick the top 3 among those
 : 3) Use those to boost search results.

 I think that totaly makes sense ... the additional bit i was suggesting
 that you consider is that instead of picking the highest 3 z-scores,
 pick the z-scores with the greatest absolute value ... that way if someone
 is a very booring person and their positive interests are all basically
 exactly the same as the mean for everyone else, but they have some very
 strong dis-interests you don't bother boosting on those miniscule
 interests and instead you negatively boost on the things they are
 antogonistic against.


 -Hoss



Re: Boosting documents by categorical preferences

2013-11-18 Thread Amit Nithian
Hey Chris,

Sorry for the delay and thanks for your response. This was inspired by your
talk on boosting and biasing that you presented way back when at a meetup.
I'm glad that my general approach seems to make sense.

My approach was something like:
1) Look at the categories that the user has preferred and compute the
z-score
2) Pick the top 3 among those
3) Use those to boost search results.

I'll look at using the boosts as an exponent instead of a multiplier as I
think that would make sense.. also as it handles the 0 case.

This is for a prototype I am doing but I'll share the results one day in a
meetup as I think it'll be kinda interesting.

Thanks again
Amit


On Thu, Nov 14, 2013 at 11:11 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I have a question around boosting. I wanted to use the boost= to write a
 : nested query that will boost a document based on categorical preferences.

 You have no idea how stoked I am to see you working on this in a real
 world application.

 : Currently I have the weights set to the z-score equivalent of a user's
 : preference for that category which is simply how many standard deviations
 : above the global average is this user's preference for that movie
 category.
 :
 : My question though is basically whether or not semantically the equation
 : query(category:Drama)*some weight + query(category:Comedy)*some
 weight
 : + query(category:Action)*some weight makes sense?

 My gut says that your apprach makes sense -- but if i'm
 understadning you correclty, i think that you need to add 1 to
 all your weights: the boost is a multiplier, so if someone's rating for
 every category is is 0 std devs above the average rating (ie: the most
 average person imaginable), you don't wnat to give every moving in every
 category a score of 0.

 Are you picking the top 3 categories the user prefers as a cut off, or
 are you arbitrarily using N category boosts for however many N categories
 the user is above the global average in their pref for that category?

 Are your prefrences coming from explicit user feedback on the categories
 (ie: rate how much you like comedies on a scale of 1-5) or are you
 infering it from user ratings of the movies themselves? (ie: rate this
 movie, which happens to be an scifi,action,comedy, on a scale of 1-5) ...
 because if it's hte later you probably want to be careful to also
 normalize based on how many categories the movie is in.

 the other thing to consider is wether you want to include negative
 prefrences (ie: weights less then 1) based on how many std dev the user's
 average is *below* the global average for a category .. in this case i
 *think* you'd want to divide the raw value from -1 to get a useful
 multiplier.

 Alternatively: you oculd experiment with using the weights as exponents
 instead of multipliers...


 b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448))

 ...that would simplify the math you'd have to worry about both for the
 totally boring average user (x**0 = 1) and for the categories users hate
 (x**-5 = some positive fraction that will act as a penalty) ... but you'd
 definitley need to run some tests to see if it over boosts as the std
 dev variations get really high (might want to take a root first before
 using them as the exponent)



 -Hoss



Boosting documents by categorical preferences

2013-11-12 Thread Amit Nithian
Hi all,

I have a question around boosting. I wanted to use the boost= to write a
nested query that will boost a document based on categorical preferences.

For a movie search for example, say that a user likes drama, comedy, and
action. I could use things like

qq=q={!boost%20b=$b%20defType=edismax%20v=$qq}b=sum(product(query($cat1),1.482),product(query($cat2),0.1199),product(query($cat3),1.448))cat1=category:Dramacat2=category:Comedycat3=category:Action

where cat1=Drama cat2=Comedy cat3=Action

Currently I have the weights set to the z-score equivalent of a user's
preference for that category which is simply how many standard deviations
above the global average is this user's preference for that movie category.

My question though is basically whether or not semantically the equation
query(category:Drama)*some weight + query(category:Comedy)*some weight
+ query(category:Action)*some weight makes sense?

What are some techniques people use to boost documents based on discrete
things like category, manufacturer, genre etc?

Thanks!
Amit


Re: When is/should qf different from pf?

2013-10-28 Thread Amit Nithian
Thanks Erick. Numeric fields make sense as I guess would strictly string
fields too since its one  term? In the normal text searching case though
does it make sense to have qf and pf differ?

Thanks
Amit
On Oct 28, 2013 3:36 AM, Erick Erickson erickerick...@gmail.com wrote:

 The facetious answer is when phrases aren't important in the fields.
 If you're doing a simple boolean match, adding phrase fields will add
 expense, to no good purpose etc. Phrases on numeric
 fields seems wrong.

 FWIW,
 Erick


 On Mon, Oct 28, 2013 at 1:03 AM, Amit Nithian anith...@gmail.com wrote:

  Hi all,
 
  I have been using Solr for years but never really stopped to wonder:
 
  When using the dismax/edismax handler, when do you have the qf different
  from the pf?
 
  I have always set them to be the same (maybe different weights) but I was
  wondering if there is a situation where you would have a field in the qf
  not in the pf or vice versa.
 
  My understanding from the docs is that qf is a term-wise hard filter
 while
  pf is a phrase-wise boost of documents who made it past the qf filter.
 
  Thanks!
  Amit
 



Re: How to configure solr to our java project in eclipse

2013-10-27 Thread Amit Nithian
Try this:
http://hokiesuns.blogspot.com/2010/01/setting-up-apache-solr-in-eclipse.html

I use this today and it still works. If anything is outdated (as it's a
relatively old post) let me know.
I wrote this so ping me if you have any questions.

Thanks
Amit


On Sun, Oct 27, 2013 at 7:33 PM, Amit Aggarwal amit.aggarwa...@gmail.comwrote:

 How so you start your another project ? If it is maven or ant then you can
 use anturn plugin to start solr . Otherwise you can write a small shell
 script to start solr ..
  On 27-Oct-2013 9:15 PM, giridhar girimc...@gmail.com wrote:

  Hi friends,Iam giridhar.please clarify my doubt.
 
  we are using solr for our project.the problem the solr is outside of our
  project( in another folder)
 
  we have to manually type java -start.jar to start the solr and use that
  services.
 
  But what we need is,when we run the project,the solr should be
  automatically
  start.
 
  our project is a java project with tomcat in eclipse.
 
  How can i achieve this.
 
  Please help me.
 
  Thankyou.
  Giridhar
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-to-configure-solr-to-our-java-project-in-eclipse-tp4097954.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



When is/should qf different from pf?

2013-10-27 Thread Amit Nithian
Hi all,

I have been using Solr for years but never really stopped to wonder:

When using the dismax/edismax handler, when do you have the qf different
from the pf?

I have always set them to be the same (maybe different weights) but I was
wondering if there is a situation where you would have a field in the qf
not in the pf or vice versa.

My understanding from the docs is that qf is a term-wise hard filter while
pf is a phrase-wise boost of documents who made it past the qf filter.

Thanks!
Amit


Re: Restaurant availability from database

2013-05-23 Thread Amit Nithian
Hossman did a presentation on something similar to this using spatial data
at a Solr meetup some months ago.

http://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

May be helpful to you.


On Thu, May 23, 2013 at 9:40 AM, rajh ron...@trimm.nl wrote:

 Thank you for your answer.

 Do you mean I should index the availability data as a document in Solr?
 Because the availability data in our databases is around 6,509,972 records
 and contains the availability per number of seats and per 15 minutes. I
 also
 tried this method, and as far as I know it's only possible to join the
 availability documents and not to include that information per result
 document.

 An example API response (created from the Solr response):
 {
 restaurants: [
 {
 id: 13906,
 name: Allerlei,
 zipcode: 6511DP,
 house_number: 59,
 available: true
 },
 {
 id: 13907,
 name: Voorbeeld,
 zipcode: 6512DP,
 house_number: 39,
 available: false
 }
 ],
 resultCount: 12156,
 resultCountAvailable: 55,
 }

 I'm currently hacking around the problem by executing the search again with
 a very high value for the rows parameter and counting the number of
 available restaurants on the backend, but this causes a big performance
 impact (as expected).




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Restaurant-availability-from-database-tp4065609p4065710.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: writing a custom Filter plugin?

2013-05-14 Thread Amit Nithian
At first I thought you were referring to Filters in Lucene at query time
(i.e. bitset filters) but I think you are referring to token filters at
indexing/text analysis time?

I have had success writing my own Filter as the link presents. The key is
that you should write a custom class that extends TokenFilter (
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/analysis/TokenFilter.html)
and write the implementation in your incrementToken() method.

My recollection of this is that instead of returning something of a Token
like you would have in earlier versions of Lucene, you set attribute values
on a notional current token. One obvious attribute is the term text
itself and perhaps any positional information. The best place to start is
to pick a fairly simple example from the Solr Source (maybe
lowercasefilter) and try and mimic that.

Cheers!
Amit


On Mon, May 13, 2013 at 1:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Does anyone know of any tutorials, basic examples, and/or documentation on
 writing your own Filter plugin for Solr? For Solr 4.x/4.3?

 I would like a Solr 4.3 version of the normalization filters found here
 for Solr 1.4: 
 https://github.com/billdueber/**lib.umich.edu-solr-stuffhttps://github.com/billdueber/lib.umich.edu-solr-stuff

 But those are old, for Solr 1.4.

 Does anyone have any hints for writing a simple substitution Filter for
 Solr 4.x?  Or, does a simple sourcecode example exist anywhere?



Re: Need solr query help

2013-05-14 Thread Amit Nithian
Is it possible instead to store in your solr index a bounding box of store
location + delivery radius, do a bounding box intersection between your
user's point + radius (as a bounding box) and the shop's delivery bounding
box. If you want further precision, the frange may work assuming it's a
post-filter implementation so that you are doing heavy computation on a
presumably small set of data only to filter out the corner cases around the
radius circle that results.

I haven't looked at Solr's spatial querying in a while to know if this is
possible or not.

Cheers
Amit


On Sat, May 11, 2013 at 10:42 AM, smsolr sms...@hotmail.com wrote:

 Hi Abhishek,

 I forgot to explain why it works.  It uses the frange filter which is
 mentioned here:-

 http://wiki.apache.org/solr/CommonQueryParameters

 and it works because it filters in results where the geodist minus the
 shopMaxDeliveryDistance is less than zero (that's what the u=0 means, upper
 limit=0), i.e.:-

 geodist - shopMaxDeliveryDistance  0
 -
 geodist  shopMaxDeliveryDistance

 i.e. the geodist is less than the shopMaxDeliveryDistance and so the shop
 is
 within delivery range of the location specified.

 smsolr



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Need-solr-query-help-tp4061800p4062603.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Sharing index amongst multiple nodes

2013-04-06 Thread Amit Nithian
I don't understand why this would be more performant.. seems like it'd be
more memory and resource intensive as you'd have multiple class-loaders and
multiple cache spaces for no good reason. Just have a single core with
sufficiently large caches to handle your response needs.

If you want to load balance reads consider having multiple physical nodes
with a master/slaves or SolrCloud.


On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.comwrote:

 Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple
 SOLR war files, sharing the same index (i.e. sharing the same solr_home)
 where only one SOLR instance is used for writing and the others for
 reading?

 Is this possible?

 Is it beneficial - is it more performant than having just one solr
 instance?

 How does it affect auto-commits i.e. how would the read nodes know the
 index has been changed and re-populate cache etc.?

 Sole 3.6.1

 Thanks.



Re: how to skip test while building

2013-04-06 Thread Amit Nithian
If you generate the maven pom files you can do this I think by doing mvn
whtaever here -DskipTests=true.


On Sat, Apr 6, 2013 at 7:25 AM, Erick Erickson erickerick...@gmail.comwrote:

 Don't know a good way to skip compiling the tests, but there isn't
 any harm in compiling them...

 changing to the solr directory and just issuing
 ant example dist builds pretty much everything. You don't execute
 tests unless you specify ant test.

 ant -p shows you all the targets. Note that you have different
 targets depending on whether you're executing it in solr_home or
 solr_home/solr or solr_home/lucene.

 Since you mention Solr, you probably want to work in solr_home/solr to
 start.

 Best
 Erick

 On Sat, Apr 6, 2013 at 5:36 AM, parnab kumar parnab.2...@gmail.com
 wrote:
  Hi All,
 
I am new to Solr . I am using solr 3.4 . I want to build without
  building  lucene tests files in lucene and skip the tests to be fired .
 Can
  anyone please help where to make the necessary changes .
 
  Thanks,
  Pom



Re: Solr 4.2 single server limitations

2013-04-05 Thread Amit Nithian
There's a whole heap of information that is missing like what you plan on
storing vs indexing and yes QPS too. My short answer is try with one server
until it falls over then start adding more.

When you say multiple-server setup do you mean multiple servers where each
server acts as a slave storing the entire index so you have load balancing
across multiple servers OR do you mean multiple servers where each server
stores a portion of the data? If it's the former, sometimes a simple
master/slave setup in Solr 4.x works but the latter may mean SolrCloud.
Master/Slave is easy but I don't know much about SolrCloud.

Questions to think about (this is not exhaustive by any means)
1) When you say 5-10 pages per website (300+ websites) that you are
crawling 2x per hour, are you *replacing* the old copy of the web page in
your index or storing some form of history for some reason.
2) What are you planning on storing vs indexing which would dictate your
memory requirements.
3) You mentioned you don't know QPS but having some guess would help.. is
it mostly for storage and occasional lookup (where slow responses is
probably tolerable) or is this powering a real user-facing website (where
low latency is prob desired).

Again, I like to start simple and use one server until it dies then expand
from there.

Cheers
Amit


On Thu, Apr 4, 2013 at 7:58 AM, imehesz imeh...@gmail.com wrote:

 hello,

 I'm using a single server setup with Nutch (1.6) and Solr (4.2)

 I plan to trigger the Nutch crawling process every 30 minutes or so and add
 about 300+ websites a month with (~5-10 pages each). At this point I'm not
 sure about the query requests/sec.

 Can I run this on a single server (how long)?
 If not, what would be the best and most efficient way to have multiple
 server setup?

 thanks,
 --iM



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-2-single-server-limitations-tp4053829.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: do SearchComponents have access to response contents

2013-04-04 Thread Amit Nithian
We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

== Can't you get this from your container access logs after the fact? I
may be misunderstanding something but why wouldn't mining the Jetty/Tomcat
logs for the response size here suffice?

Thanks!
Amit


On Thu, Apr 4, 2013 at 1:34 AM, xavier jmlucjav jmluc...@gmail.com wrote:

 A custom QueryResponseWriter...this makes sense, thanks Jack


 On Wed, Apr 3, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  The search components can see the response as a namedlist, but it is
  only when SolrDispatchFIlter calls the QueryResponseWriter that XML or
 JSON
  or whatever other format (Javabin as well) is generated from the named
 list
  for final output in an HTTP response.
 
  You probably want a custom query response writer that wraps the XML
  response writer. Then you can generate the XML and then do whatever you
  want with it.
 
  The QueryResponseWriter class and queryResponseWriter in
 solrconfig.xml.
 
  -- Jack Krupansky
 
  -Original Message- From: xavier jmlucjav
  Sent: Wednesday, April 03, 2013 4:22 PM
  To: solr-user@lucene.apache.org
  Subject: do SearchComponents have access to response contents
 
 
  I need to implement some SearchComponent that will deal with metrics on
 the
  response. Some things I see will be easy to get, like number of hits for
  instance, but I am more worried with this:
 
  We need to also track the size of the response (as the size in bytes of
 the
  whole xml response tat is streamed, with stored fields and all). I was a
  bit worried cause I am wondering if a searchcomponent will actually have
  access to the response bytes...
 
  Can someone confirm one way or the other? We are targeting Sorl4.0
 
  thanks
  xavier
 



Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc
to have redundancy b/c HDFS isn't designed to serve real time queries as
far as I understand. If you are using HDFS as a backup mechanism to me
you'd be better served having multiple slaves tethered to a master (in a
non-cloud environment) or setup SolrCloud either option would give you more
redundancy than copying an index to HDFS.

- Amit


On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote:

 Hi Upayavira,

 sure, let me explain. I am setting up Nutch and SOLR in hadoop environment.
 Since I am using hdfs, in the event if there is any crashes to the
 localhost(running solr), i will still have the shards of data being stored
 in hdfs.

 Thanks you so much =)

 On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote:

  What are you actually trying to achieve? If you can share what you are
  trying to achieve maybe folks can help you find the right way to do it.
 
  Upayavira
 
  On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
   Hello Otis ,
  
   Is there any configuration where it will index into hdfs instead?
  
   I tried crawlzilla and  lily but I hope to update specific package such
   as
   Hadoop only or nutch only when there are updates.
  
   That's y would prefer to install separately .
  
   Thanks so much. Looking forward for your reply.
  
   On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
  
Hello Joseph,
   
You can certainly put them there, as in:
  hadoop fs -copyFromLocal localsrc URI
   
But searching such an index will be slow.
See also: http://katta.sourceforge.net/
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com
  javascript:;
wrote:
   
 Hi,
 Would like to know how can i put the indexed solr shards into hdfs?

 Thanks..

 Joseph
 On Mar 6, 2013 7:28 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.comjavascript:;

 wrote:

  Hi Joseph,
 
  What exactly are you looking to to?
  See http://incubator.apache.org/blur/
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim ysli...@gmail.com
  javascript:;
wrote:
 
   Hi I am running hadoop distributed file system, how do I put my
output
 of
   the solr dir into hdfs automatically?
  
   Thanks so much..
  
   --
   Best Regards,
   *Joseph*
  
 

   
  
  
   --
   Best Regards,
   *Joseph*
 



 --
 Best Regards,
 *Joseph*



Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Joseph,

Doing what Otis said will do literally what you want which is copying the
index to HDFS. It's no different than copying it to a different machine
which btw is what Solr's master/slave replication scheme does.
Alternatively, I think people are starting to setup new Solr instances with
SolrCloud which doesn't have the concept of master/slave but rather a
series of nodes with the option of having replicas (what I believe to be
backup nodes) so that you have the redundancy you want.

Honestly HDFS in the way that you are looking for is probably no different
than storing  your solr index in a RAIDed storage format but I don't
pretend to know much about RAID arrays.

What exactly are you trying to achieve from a systems perspective? Why do
you want Hadoop in the mix here and how does copying the index to HDFS help
you? If SolrCloud seems complicated try just setting up a simple
master/slave replication scheme for that's really easy.

Cheers
Amit


On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim ysli...@gmail.com wrote:

 Hi Amit,

 so you mean that if I just want to get redundancy for solr in hdfs, the
 only best way to do it is to as per what Otis suggested using the following
 command

 hadoop fs -copyFromLocal localsrc URI

 Ok let me try out solrcloud as I will need to make sure it works well with
 nutch too..

 Thanks for the help..


 On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian anith...@gmail.com wrote:

  Why wouldn't SolrCloud help you here? You can setup shards and replicas
 etc
  to have redundancy b/c HDFS isn't designed to serve real time queries as
  far as I understand. If you are using HDFS as a backup mechanism to me
  you'd be better served having multiple slaves tethered to a master (in a
  non-cloud environment) or setup SolrCloud either option would give you
 more
  redundancy than copying an index to HDFS.
 
  - Amit
 
 
  On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote:
 
   Hi Upayavira,
  
   sure, let me explain. I am setting up Nutch and SOLR in hadoop
  environment.
   Since I am using hdfs, in the event if there is any crashes to the
   localhost(running solr), i will still have the shards of data being
  stored
   in hdfs.
  
   Thanks you so much =)
  
   On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote:
  
What are you actually trying to achieve? If you can share what you
 are
trying to achieve maybe folks can help you find the right way to do
 it.
   
Upayavira
   
On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
 Hello Otis ,

 Is there any configuration where it will index into hdfs instead?

 I tried crawlzilla and  lily but I hope to update specific package
  such
 as
 Hadoop only or nutch only when there are updates.

 That's y would prefer to install separately .

 Thanks so much. Looking forward for your reply.

 On Wednesday, March 6, 2013, Otis Gospodnetic wrote:

  Hello Joseph,
 
  You can certainly put them there, as in:
hadoop fs -copyFromLocal localsrc URI
 
  But searching such an index will be slow.
  See also: http://katta.sourceforge.net/
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com
javascript:;
  wrote:
 
   Hi,
   Would like to know how can i put the indexed solr shards into
  hdfs?
  
   Thanks..
  
   Joseph
   On Mar 6, 2013 7:28 PM, Otis Gospodnetic 
otis.gospodne...@gmail.comjavascript:;
  
   wrote:
  
Hi Joseph,
   
What exactly are you looking to to?
See http://incubator.apache.org/blur/
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim 
 ysli...@gmail.com
javascript:;
  wrote:
   
 Hi I am running hadoop distributed file system, how do I
 put
  my
  output
   of
 the solr dir into hdfs automatically?

 Thanks so much..

 --
 Best Regards,
 *Joseph*

   
  
 


 --
 Best Regards,
 *Joseph*
   
  
  
  
   --
   Best Regards,
   *Joseph*
  
 



 --
 Best Regards,
 *Joseph*



Re: ping query frequency

2013-03-03 Thread Amit Nithian
We too run a ping every 5 seconds and I think the concurrent Mark/Sweep
helps to avoid the LB from taking a box out of rotation due to long pauses.
Either that or I don't see large enough pauses for my LB to take it out
(it'd have to fail 3 times in a row or 15 seconds total before it's gone).

The ping query does execute an actual query so of course you want to make
this as simple as possible (i.e. q=primary_key:value) so that there's
limited to no scanning of the index. I think our query does an id:0 which
would always return 0 docs but also any stupid-simple query is fine so long
as it hits the caches on subsequent hits. The goal, to me at least, is not
that the ping query yields actual docs but that it's a mechanism to remove
a solr server out of rotation without having to login to an ops
controlled device directly.

I'd definitely remove the ping per request (wouldn't the fact that you are
doing /select serve as the ping and hence defeat the purpose of the ping
query) and definitely do the frequent ping as we are describing if you want
to have your solr boxes behind some load balancer.


On Sun, Mar 3, 2013 at 8:21 AM, Shawn Heisey s...@elyograg.org wrote:

 On 3/3/2013 2:15 AM, adm1n wrote:

 I'm wonderring how frequent this query should be made. Currently it is
 done
 before each select request (some very old legacy). I googled a little and
 found out that it is bad practice and has performance impact. So the
 question is should I completely remove it or just do it once in some
 period
 of time.


 Can you point me at the place where it says that it's bad practice to do
 frequent pings?  I use the ping functionality in my haproxy load balancer
 that sits in front of Solr.  It executes a ping request against all my Solr
 instances every five seconds.  Most of the time, the ping request (which is
 distributed) finishes in single-digit milliseconds. If that is considered
 bad practice, I want to figure out why and submit issues to get the problem
 fixed.

 I can imagine that sending a ping before every query would be a bad idea,
 but I am hoping that the way I'm using it is OK.

 The only problem with ping requests that I have ever noticed was caused by
 long garbage collection pauses on my 8GB Solr heap.  Those pauses caused
 the load balancer to incorrectly mark the active Solr instance(s) as down
 and send requests to a backup.

 Through experimentation with -XX memory tuning options, I have now
 eliminated the GC pause problem.  For machines running Solr 4.2-SNAPSHOT, I
 have reduced the heap to 6GB, the 3.5.0 machines are still running with 8GB.

 Thanks,
 Shawn




Re: Poll: SolrCloud vs. Master-Slave usage

2013-03-01 Thread Amit Nithian
But does that mean that in SolrCloud, slave nodes are busy indexing
documents?


On Fri, Mar 1, 2013 at 5:37 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Amit,

 NRT is not possible in a master-slave setup because of the necessity
 of a hard commit and replication, both of which add considerable
 delay.

 Solr Cloud sends each document for a given shard to each node hosting
 that shard, so there's no need for the hard commit and replication for
 visibility.

 You could conceivably get NRT on a single node without Solr Cloud, but
 there would be no redundancy.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Fri, Mar 1, 2013 at 1:22 AM, Amit Nithian anith...@gmail.com wrote:
  Erick,
 
  Well put and thanks for the clarification. One question:
  And if you need NRT, you just can't get it with traditional M/S setups.
  == Can you explain how that works with SolrCloud?
 
  I agree with what you said too because there was an article or
 discussion I
  read that said having high-availability masters requires some fairly
  complicated setups and I guess I am under-estimating how
  expensive/complicated our setup is relative to what you can get out of
 the
  box with SolrCloud.
 
  Thanks!
  Amit
 
 
  On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Amit:
 
  It's a balancing act. If I was starting fresh, even with one shard, I'd
  probably use SolrCloud rather than deal with the issues around the how
 do
  I recover if my master goes down question. Additionally, SolrCloud
 allows
  one to monitor the health of the entire system by monitoring the state
  information kept in Zookeeper rather than build a monitoring system that
  understands the changing topology of your network.
 
  And if you need NRT, you just can't get it with traditional M/S setups.
 
  In a mature production system where all the operational issues are
 figured
  out and you don't need NRT, it's easier just to plop 4.x in traditional
 M/S
  setups and not go to SolrCloud. And you're right, you have to understand
  Zookeeper which isn't all that difficult, but is another moving part and
  I'm a big fan of keeping the number of moving parts down if possible.
 
  It's not a one-size-fits-all situation. From what you've described, I
 can't
  say there's a compelling reason to do the SolrCloud thing. If you find
  yourself spending lots of time building monitoring or High
  Availability/Disaster Recovery tools, then you might find the
 cost/benefit
  analysis changing.
 
  Personally, I think it's ironic that the memory improvements that came
  along _with_ SolrCloud make it less necessary to shard. Which means that
  traditional M/S setups will suit more people longer G
 
  Best
  Erick
 
 
  On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com
 wrote:
 
   I don't know a ton about SolrCloud but for our setup and my limited
   understanding of it is that you start to bleed operational and
   non-operational aspects together which I am not comfortable doing
 (i.e.
   software load balancing). Also adding ZooKeeper to the mix is yet
 another
   thing to install, setup, monitor, maintain etc which doesn't add any
  value
   above and beyond what we have setup already.
  
   For example, we have a hardware load balancer that can do the actual
 load
   balancing of requests among the slaves and taking slaves in and out of
   rotation either on demand or if it's down. We've placed a virtual IP
 on
  top
   of our multiple masters so that we have redundancy there. While we
 have
   multiple cores, the data volume is large enough to fit on one node so
 we
   aren't at the data volume necessary for sharding our indices. I
 suspect
   that if we had a sufficiently large dataset that couldn't fit on one
 box
   SolrCloud is perfect but when you can fit on one box, why add more
   complexity?
  
   Please correct me if I'm wrong for I'd like to better understand this!
  
  
  
  
   On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:
  
I am doing research on SolrCloud.
   
   
   
--
View this message in context:
   
  
 
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
  
 



Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-28 Thread Amit Nithian
I don't know a ton about SolrCloud but for our setup and my limited
understanding of it is that you start to bleed operational and
non-operational aspects together which I am not comfortable doing (i.e.
software load balancing). Also adding ZooKeeper to the mix is yet another
thing to install, setup, monitor, maintain etc which doesn't add any value
above and beyond what we have setup already.

For example, we have a hardware load balancer that can do the actual load
balancing of requests among the slaves and taking slaves in and out of
rotation either on demand or if it's down. We've placed a virtual IP on top
of our multiple masters so that we have redundancy there. While we have
multiple cores, the data volume is large enough to fit on one node so we
aren't at the data volume necessary for sharding our indices. I suspect
that if we had a sufficiently large dataset that couldn't fit on one box
SolrCloud is perfect but when you can fit on one box, why add more
complexity?

Please correct me if I'm wrong for I'd like to better understand this!




On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:

 I am doing research on SolrCloud.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-28 Thread Amit Nithian
Erick,

Well put and thanks for the clarification. One question:
And if you need NRT, you just can't get it with traditional M/S setups.
== Can you explain how that works with SolrCloud?

I agree with what you said too because there was an article or discussion I
read that said having high-availability masters requires some fairly
complicated setups and I guess I am under-estimating how
expensive/complicated our setup is relative to what you can get out of the
box with SolrCloud.

Thanks!
Amit


On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.comwrote:

 Amit:

 It's a balancing act. If I was starting fresh, even with one shard, I'd
 probably use SolrCloud rather than deal with the issues around the how do
 I recover if my master goes down question. Additionally, SolrCloud allows
 one to monitor the health of the entire system by monitoring the state
 information kept in Zookeeper rather than build a monitoring system that
 understands the changing topology of your network.

 And if you need NRT, you just can't get it with traditional M/S setups.

 In a mature production system where all the operational issues are figured
 out and you don't need NRT, it's easier just to plop 4.x in traditional M/S
 setups and not go to SolrCloud. And you're right, you have to understand
 Zookeeper which isn't all that difficult, but is another moving part and
 I'm a big fan of keeping the number of moving parts down if possible.

 It's not a one-size-fits-all situation. From what you've described, I can't
 say there's a compelling reason to do the SolrCloud thing. If you find
 yourself spending lots of time building monitoring or High
 Availability/Disaster Recovery tools, then you might find the cost/benefit
 analysis changing.

 Personally, I think it's ironic that the memory improvements that came
 along _with_ SolrCloud make it less necessary to shard. Which means that
 traditional M/S setups will suit more people longer G

 Best
 Erick


 On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com wrote:

  I don't know a ton about SolrCloud but for our setup and my limited
  understanding of it is that you start to bleed operational and
  non-operational aspects together which I am not comfortable doing (i.e.
  software load balancing). Also adding ZooKeeper to the mix is yet another
  thing to install, setup, monitor, maintain etc which doesn't add any
 value
  above and beyond what we have setup already.
 
  For example, we have a hardware load balancer that can do the actual load
  balancing of requests among the slaves and taking slaves in and out of
  rotation either on demand or if it's down. We've placed a virtual IP on
 top
  of our multiple masters so that we have redundancy there. While we have
  multiple cores, the data volume is large enough to fit on one node so we
  aren't at the data volume necessary for sharding our indices. I suspect
  that if we had a sufficiently large dataset that couldn't fit on one box
  SolrCloud is perfect but when you can fit on one box, why add more
  complexity?
 
  Please correct me if I'm wrong for I'd like to better understand this!
 
 
 
 
  On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:
 
   I am doing research on SolrCloud.
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 



Re: numFound is not correct while using Result Grouping

2013-02-26 Thread Amit Nithian
I need to write some tests which I hope to do tonight and then I think
it'll get into 4.2


On Tue, Feb 26, 2013 at 6:24 AM, Nicholas Ding nicholas...@gmail.comwrote:

 Thanks Amit, that's cool! So it will also be fixed on Solr 4.2, right?

 On Mon, Feb 25, 2013 at 6:04 PM, Amit Nithian anith...@gmail.com wrote:

  Yeah I had a similar problem. I filed and submitted this patch:
  https://issues.apache.org/jira/browse/SOLR-4310
 
  Let me know if this is what you are looking for!
  Amit
 
 
  On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com
 wrote:
 
   Ah, I see. The docs say Although this result format does not have as
  much
   information, it may be easier for existing solr clients to parse. I
  guess
   the ngroups value could be added to this format, but apparently it
  isn't. I
   do agree with you that to be usefull (as in possible to read for a
 client
   that doesn't know of the grouped format), the number should be that of
  the
   groups, not of the documents.
  
   A quick glance in the code learns that it is indeed not calculated in
  this
   case. But not completely trivial to fix. Could you use format=simple
   instead? That will work with ngroups.
  
   Teun
  
  
   2013/2/25 Nicholas Ding nicholas...@gmail.com
  
Thanks Teun and Carlos, I set group.ngroups=true, but I don't have
 this
ngroup number when I was using group.main = true.
   
On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto 
cmar...@searchtechnologies.com wrote:
   
 Use group.ngroups, check it in the Solr wiki for FieldCollapsing

 Carlos Maroto
 Search Architect at Search Technologies (
 www.searchtechnologies.com)



 Nicholas Ding nicholas...@gmail.com wrote:


 Hello,

 I grouped the result, and set group.main=true. I was expecting the
numFound
 equals to the number of groups, but actually it was not.

 How do I get the number of groups?

 Thanks
 Nicholas

   
  
 



Re: numFound is not correct while using Result Grouping

2013-02-25 Thread Amit Nithian
Yeah I had a similar problem. I filed and submitted this patch:
https://issues.apache.org/jira/browse/SOLR-4310

Let me know if this is what you are looking for!
Amit


On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com wrote:

 Ah, I see. The docs say Although this result format does not have as much
 information, it may be easier for existing solr clients to parse. I guess
 the ngroups value could be added to this format, but apparently it isn't. I
 do agree with you that to be usefull (as in possible to read for a client
 that doesn't know of the grouped format), the number should be that of the
 groups, not of the documents.

 A quick glance in the code learns that it is indeed not calculated in this
 case. But not completely trivial to fix. Could you use format=simple
 instead? That will work with ngroups.

 Teun


 2013/2/25 Nicholas Ding nicholas...@gmail.com

  Thanks Teun and Carlos, I set group.ngroups=true, but I don't have this
  ngroup number when I was using group.main = true.
 
  On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto 
  cmar...@searchtechnologies.com wrote:
 
   Use group.ngroups, check it in the Solr wiki for FieldCollapsing
  
   Carlos Maroto
   Search Architect at Search Technologies (www.searchtechnologies.com)
  
  
  
   Nicholas Ding nicholas...@gmail.com wrote:
  
  
   Hello,
  
   I grouped the result, and set group.main=true. I was expecting the
  numFound
   equals to the number of groups, but actually it was not.
  
   How do I get the number of groups?
  
   Thanks
   Nicholas
  
 



Re: [ANN] vifun: tool to help visually tweak Solr boosting

2013-02-25 Thread Amit Nithian
This is cool! I had done something similar except changing via JConsole/JMX:
https://issues.apache.org/jira/browse/SOLR-2306

We had something not as nice at Zvents but I wanted to expose these as
MBean properties so you could change them via any JMX UI like JVisualVM

Cheers!
Amit


On Mon, Feb 25, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote:

 Apologies...instructions are wrong on the cd, these commands are to be run
 at the top level of the project...I fixed the doc to read:

 cd vifun
 griffon run-app



 On Mon, Feb 25, 2013 at 10:45 PM, Jan Høydahl jan@cominvent.com
 wrote:

  Hi,
 
  I actually tried ../griffonw run-app but it says griffon-app does not
  appear to be part of a Griffon application.
 
  I installed griffon and tried again griffon run-app inside of
  griffon-app, but same error.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  25. feb. 2013 kl. 19:51 skrev jmlucjav jmluc...@gmail.com:
 
   Jan, thanks for looking at this!
  
   - Running from source: would you care to send me the error you get (if
  any)
   when running from source? I assume you have griffon1.1.0 installed
 right?
  
   - Binary dist: the distrib is created by griffon, so I'll check if the
   permission issue (I develop on windows, and tested on a clean windows
  too,
   so I don't face the issue you mention) is known or can be fixed
 somehow.
   I'll update the doc anyway.
  
   - wt param: I am already overriding wt param (in order to use javabin).
   What I didn't allow is to choose the handler to be used when submitting
  the
   query. I guess any handler that does not have appends/invariants
 that
   would interfere would work fine, I just thought /select is mostly
  available
   in most installations and that is one thing less to configure. But
 yes, I
   could let the user configure it, I'll open an issue.
  
   xavier
  
   On Mon, Feb 25, 2013 at 3:10 PM, Jan Høydahl jan@cominvent.com
  wrote:
  
   Cool. I tried running from source (using the bundled griffonw), but I
   think the instructions may be wrong, had to download binary dist.
   The file permissions for bin/vifun in binary dist should have +w so
 you
   can execute it with ./vifun
  
   What about the ability to override the wt param, so that you can
 point
   it to the /browse handler directly?
  
   --
   Jan Høydahl, search solution architect
   Cominvent AS - www.cominvent.com
   Solr Training - www.solrtraining.com
  
   23. feb. 2013 kl. 15:12 skrev jmlucjav jmluc...@gmail.com:
  
   Hi,
  
   I have built a small tool to help me tweak some params in Solr
  (typically
   qf, bf in edismax). As maybe others find it useful, I am open
 sourcing
  it
   on github: https://github.com/jmlucjav/vifun
  
   Check github for some more info and screenshots. I include part of
 the
   github page below.
   regards
  
   Description
  
   Did you ever spend lots of time trying to tweak all numbers in a
   *edismax*
   handler *qf*, *bf*, etc params so docs get scored to your liking?
  Imagine
   you have the params below, is 20 the right boosting for *name* or is
 it
   too
   much? Is *population* being boosted too much versus distance? What
  about
   new documents?
  
 !-- fields, boost some --
 str name=qfname^20 textsuggest^10 edge^5 ngram^2
   phonetic^1/str
 str name=mm33%/str
 !-- boost closest hits --
 str name=bfrecip(geodist(),1,500,0)/str
 !-- boost by population --
 str name=bfproduct(log(sum(population,1)),100)/str
 !-- boost newest docs --
 str name=bfrecip(rord(moddate),1,1000,1000)/str
  
   This tool was developed in order to help me tweak the values of
  boosting
   functions etc in Solr, typically when using edismax handler. If you
 are
   fed
   up of: change a number a bit, restart Solr, run the same query to see
  how
   documents are scored now...then this tool is for you.
   https://github.com/jmlucjav/vifun#featuresFeatures
  
- Can tweak numeric values in the following params: *qf, pf, bf, bq,
boost, mm* (others can be easily added) even in *appends or
invariants*
- View side by side a Baseline query result and how it changes when
  you
gradually change each value in the params
- Colorized values, color depends on how the document does related
 to
baseline query
- Tooltips give you Explain info
- Works on remote Solr installations
- Tested with Solr 3.6, 4.0 and 4.1 (other versions would work too,
 as
long as wt=javabin format is compatible)
- Developed using Groovy/Griffon
  
   https://github.com/jmlucjav/vifun#requirementsRequirements
  
- */select* handler should be available, and not have any *appends
  or
invariants*, as it could interfere with how vifun works.
- Java6 is needed (maybe it runs on Java5 too). A JRE should be
  enough.
  
   https://github.com/jmlucjav/vifun#getting-startedGetting started
 

Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
A few others have posted about this too apparently and SOLR-4413 is the
root problem. Basically what I am seeing is that if your index directory is
not index/ but rather index.timestamp set in the index.properties a new
index will be downloaded all the time because the download is expecting
your index to be in solr_data_dir/index. Sounds like a quick solution
might be to rename your index directory to just index and see if the
problem goes away.

To confirm, look at line 728 in the SnapPuller.java file (in
downloadIndexFiles)

I am hoping that the patch and a more unified getIndexDir can be added to
the next release of Solr as this is a fairly significant bug to me.

Cheers
Amit

On Thu, Feb 21, 2013 at 12:56 AM, Amit Nithian anith...@gmail.com wrote:

 So the diff in generation numbers are due to the commits I believe that
 Solr does when it has the new index files but the fact that it's
 downloading a new index each time is baffling and I just noticed that too
 (hit the replicate button and noticed a full index download). I'm going to
 pop in to the source and see what's going on to see why unless there's a
 known bug filed about this?


 On Tue, Feb 19, 2013 at 1:48 AM, Raúl Grande Durán 
 raulgrand...@hotmail.com wrote:


 Hello.
 We have recently updated our Solr from 3.5 to 4.1 and everything is
 running perfect except the replication between nodes. We have a
 master-repeater-2slaves architecture and we have seen some things that
 weren't happening before:
 When a Slave (repeater or slaves) starts to replicate it needs to
 download the entire index. Even when some little changes has been made to
 the index at master. This takes such a long time since our index is more
 than 20 Gb.After replication cycle we have different index generations in
 master, repeater and slaves. For example:Master: gen. 64590Repeater: gen.
 64591Both slaves: gen. 64592
 My replicationHandler configuration is like this:requestHandler
 name=/replication class=solr.ReplicationHandler  lst
 name=master   str name=enable${enable.master:false}/str
 str name=replicateAftercommit/str   str
 name=replicateAfterstartup/str   str
 name=confFilesschema.xml,stopwords.txt/str /lst lst
 name=slave   str name=enable${enable.slave:false}/str
 str name=masterUrl${solr.master.url:http://localhost/solr}/str
 str name=pollInterval00:03:00/str /lst /requestHandler
 Our problems are very similar to those explained here:
 http://lucene.472066.n3.nabble.com/Problem-with-replication-td2294313.html
 Any ideas?? Thanks





Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
Thanks for the links... I have updated SOLR-4471 with a proposed solution
that I hope can be incorporated or amended so we can get a clean fix into
the next version so our operations and network staff will be happier with
not having gigs of data flying around the network :-)


On Thu, Feb 21, 2013 at 1:24 AM, raulgrande83 raulgrand...@hotmail.comwrote:

 Hi Amit,

 I have came across some JIRAs that may be useful in this issue:
 https://issues.apache.org/jira/browse/SOLR-4471
 https://issues.apache.org/jira/browse/SOLR-4354
 https://issues.apache.org/jira/browse/SOLR-4303
 https://issues.apache.org/jira/browse/SOLR-4413
 https://issues.apache.org/jira/browse/SOLR-2326

 Please, let us know if you find any solution.

 Regards.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041817.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
Sounds good I am trying the combination of my patch and 4413 now to see how
it works and will have to see if I can put unit tests around them as some
of what I thought may not be true with respect to the commit generation
numbers.

For your issue above in your last post, is it possible that there was a
commit on the master in that slight window after solr checks for the latest
generation of the master but before it downloads the actual files? How
frequent are the commits on your master?


On Thu, Feb 21, 2013 at 2:00 AM, raulgrande83 raulgrand...@hotmail.comwrote:

 Thanks for the patch, we'll try to install these fixes and post if
 replication works or not.

 I renamed 'index.timestamp' folders to just 'index' but it didn't work.
 These lines appeared in the log:
 INFO: Master's generation: 64594
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Slave's generation: 64593
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchFileList
 SEVERE: No files to download for index generation: 64594



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041827.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Anyone else see this error when running unit tests?

2013-02-14 Thread Amit Nithian
Okay so I think I found a solution if you are a maven user and don't
mind forcing the test codec to Lucene40 then do the following:

Add this to your pom.xml under the

build

 pluginManagement

 plugins section


   plugin

  groupIdorg.apache.maven.plugins/groupId

  artifactIdmaven-surefire-plugin/artifactId

  version2.13/version

  configuration

   argLine-Dtests.codec=Lucene40/argLine

  /configuration

  /plugin


If you are running in Eclipse, simply add this as a VM argument. The
default test codec is set to random and this means that there is a
possibility of picking Lucene3x if some random variable is  2 and other
conditions are met. For me, my test-framework jar must not be ahead of
the lucene one (b/c I don't control the classpath order and honestly this
shouldn't be a requirement to run a test) so it periodically bombed. This
little fix seems to have helped provided that you don't care about Lucene3x
vs Lucene40 for your tests (I am on Lucene40 so it's fine for me).

HTH!

Amit


On Mon, Feb 4, 2013 at 6:18 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Me too, it fails randomly with test classes. We use Solr4.0 for testing, no
 maven, only ant.
 --roman
 On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote:

  Yes.  Just today actually.  I had some unit test based on
  AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail
  intermittently with that error message.  The key to this behavior is
 found
  by looking at the code in the lucene class:
  TestRuleSetupAndRestoreClassEnv.
  I don't understand it completely but there are a number of random code
  paths
  through there.  The following helped me get around the problem, at least
 in
  the short term.
 
 
 
 @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40})
  public class CoreLevelTest extends AbstractSolrTestCase {
 
  I also need to call this inside my setUp() method, in 4.0 this wasn't
  required.
  initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home);
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: replication problems with solr4.1

2013-02-14 Thread Amit Nithian
I may be missing something but let me go back to your original statements:
1) You build the index once per week from scratch
2) You replicate this from master to slave.

My understanding of the way replication works is that it's meant to only
send along files that are new and if any files named the same between the
master and slave have different sizes then this is a corruption of sorts
and do this index.timestamp and send the full thing down. This, I think,
explains your index.timestamp issue although why the old index/ directory
isn't being deleted i'm not sure about. This is why I was asking about OS
details, file system details etc (perhaps something else is locking that
directory preventing Java from deleting it?)

The second issue is the index generation which is governed by commits and
is represented by looking at the last few characters in the segments_XX
file. When the slave downloads the index and does the copy of the new
files, it does a commit to force a new searcher hence why the slave
generation will be +1 from the master.

The index version is a timestamp and it may be the case that the version
represents the point in time when the index was downloaded to the slave? In
general, it shouldn't matter about these details because replication is
only triggered if the master's version  slave's version and the clocks
that all servers use are synched to some common clock.

Caveat however in my answer is that I have yet to try 4.1 as this is next
on my TODO list so maybe I'll run into the same problem :-) but I wanted to
provide some info as I just recently dug through the replication code to
understand it better myself.

Cheers
Amit


On Wed, Feb 13, 2013 at 11:57 PM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 OK then index generation and index version are out of count when it comes
 to verify that master and slave index are in sync.

 What else is possible?

 The strange thing is if master is 2 or more generations ahead of slave
 then it works!
 With your logic the slave must _always_ be one generation ahead of the
 master,
 because the slave replicates from master and then does an additional commit
 to recognize the changes on the slave.
 This implies that the slave acts as follows:
 - if the master is one generation ahaed then do an additional commit
 - if the master is 2 or more generations ahead then do _no_ commit
 OR
 - if the master is 2 or more generations ahead then do a commit but don't
   change generation and version of index

 Can this be true?

 I would say not really.

 Regards
 Bernd


 Am 13.02.2013 20:38, schrieb Amit Nithian:
  Okay so then that should explain the generation difference of 1 between
 the
  master and slave
 
 
  On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote:
 
  doesn't it do a commit to force solr to recognize the changes?
 
  yes.
 
  - Mark
 
 



Re: Boost Specific Phrase

2013-02-13 Thread Amit Nithian
Have you looked at the pf parameter for dismax handlers? pf does I think
what you are looking for which is to boost documents with the query term
exactly matching in the various fields with some phrase slop.


On Wed, Feb 13, 2013 at 2:59 AM, Hemant Verma hemantverm...@gmail.comwrote:

 Hi All

 I have a use case with phrase search.

 Let say I have a list of phrases in a file/dictionaries which are important
 as per our search content.
 One entry in the dictionary is lets say - project manager.
 If user's query contains any entry specified in dictionary then I want to
 boost the score of documents which have exact match of that entry.

 Lets take one example:-

 Now suppose user searches for (project manager in India with 2 yrs
 experience).
 There are words 'project manager' in the query in exact order as specified
 in dictionary then I want to boost the score of documents having 'project
 manager' as an exact match.

 This can be done at web application level after processing user query with
 dictionary and create query as below:
 q=project manager in India with 2 yrs experienceqf=titlebq=title:project
 manager^5

 I want to know is there any better solution available to this use case at
 Solr level.

 AFAIK there is something very similar available in FAST ESP know as Phrase
 Recognition.

 Thanks
 Hemant



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: what do you use for testing relevance?

2013-02-13 Thread Amit Nithian
Ultimately this is dependent on what your metrics for success are. For some
places it may be just raw CTR (did my click through rate increase) but for
other places it may be a function of money (either it may be gross revenue,
profits, # items sold etc). I don't know if there is a generic answer for
this question which is leading those to write their own frameworks b/c it's
very specific to your needs. A scoring change that leads to an increase in
CTR may not necessarily lead to an increase in the metric that makes your
business go.


On Tue, Feb 12, 2013 at 10:31 PM, Steffen Elberg Godskesen 
steffen.godske...@gmail.com wrote:


 Hi Roman,

 If you're looking for regression testing then
 https://github.com/sul-dlss/rspec-solr might be worth looking at. If
 you're not a ruby shop, doing something similar in another language
 shouldn't be to hard.


 The basic idea is that you setup a set of tests like

 If the query is X, then the document with id Y should be in the first 10
 results
 If the query is S, then a document with title T should be the first
 result
 If the query is P, then a document with author Q should not be in the
 first 10 result

 and that you run these whenever you tune your scoring formula to ensure
 that you haven't introduced unintended effects. New ideas/requirements for
 your relevance ranking should always result in writing new tests - that
 will probably fail until you tune your scoring formula. This is certainly
 no magic bullet, but it will give you some confidence that you didn't make
 things worse. And - in my humble opinion - it also gives you the benefit of
 discouraging you from tuning your scoring just for fun. To put it bluntly:
 if you cannot write up a requirement in form of a test, you probably have
 no need to tune your scoring.


 Regards,

 --
 Steffen



 On Tuesday, February 12, 2013 at 23:03 , Roman Chyla wrote:

  Hi,
  I do realize this is a very broad question, but still I need to ask it.
  Suppose you make a change into the scoring formula. How do you
  test/know/see what impact it had? Any framework out there?
 
  It seems like people are writing their own tools to measure relevancy.
 
  Thanks for any pointers,
 
  roman





Re: replication problems with solr4.1

2013-02-13 Thread Amit Nithian
So just a hunch... but when the slave downloads the data from the master,
doesn't it do a commit to force solr to recognize the changes? In so doing,
wouldn't that increase the generation number? In theory it shouldn't matter
because the replication looks for files that are different to determine
whether or not to do a full download or a partial replication. In the event
of a full replication (an optimize would cause this), I think the
replication handler considers this a corruption and forces a full
download into this index.timestamp folder with the index.properties
pointing at this folder to tell solr this is the new index directory. Since
you mentioned you rebuild the index from scratch once per week I'd expect
to see this behavior you are mentioning.

I remember debugging the code to find out how replication works in 4.0
because of a bug that was fixed in 4.1 but I haven't read through the 4.1
code to see how much (if any) has changed from this logic.

In short, I don't know why you'd have the old index/ directory there..
that seems either like a bug or something was locking that directory in the
filesystem preventing it from being removed. What OS are you using and is
the index/ directory stored on a local file system vs NFS?

HTH
Amit


On Tue, Feb 12, 2013 at 2:26 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:


 Now this is strange, the index generation and index version
 is changing with replication.

 e.g. master has index generation 118 index version 136059533234
 and  slave  has index generation 118 index version 136059533234
 are both same.

 Now add one doc to master with commit.
 master has index generation 119 index version 1360595446556

 Next replicate master to slave. The result is:
 master has index generation 119 index version 1360595446556
 slave  has index generation 120 index version 1360595564333

 I have not seen this before.
 I thought replication is just taking over the index from master to slave,
 more like a sync?




 Am 11.02.2013 09:29, schrieb Bernd Fehling:
  Hi list,
 
  after upgrading from solr4.0 to solr4.1 and running it for two weeks now
  it turns out that replication has problems and unpredictable results.
  My installation is single index 41 mio. docs / 115 GB index size / 1
 master / 3 slaves.
  - the master builds a new index from scratch once a week
  - a replication is started manually with Solr admin GUI
 
  What I see is one of these cases:
  - after a replication a new searcher is opened on index.xxx
 directory and
the old data/index/ directory is never deleted and besides the file
replication.properties there is also a file index.properties
  OR
  - the replication takes place everything looks fine but when opening the
 admin GUI
the statistics report
  Last Modified: a day ago
  Num Docs: 42262349
  Max Doc:  42262349
  Deleted Docs:  0
  Version:  45174
  Segment Count: 1
 
  VersionGen  Size
  Master: 1360483635404  112  116.5 GB
  Slave:1360483806741  113  116.5 GB
 
 
  In the first case, why is the replication doing that???
  It is an offline slave, no search activity, just there fore backup!
 
 
  In the second case, why is the version and generation different right
 after
  full replication?
 
 
  Any thoughts on this?
 
 
  - Bernd
 

 --
 *
 Bernd FehlingBielefeld University Library
 Dipl.-Inform. (FH)LibTec - Library Technology
 Universitätsstr. 25  and Knowledge Management
 33615 Bielefeld
 Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

 BASE - Bielefeld Academic Search Engine - www.base-search.net
 *



Re: replication problems with solr4.1

2013-02-13 Thread Amit Nithian
Okay so then that should explain the generation difference of 1 between the
master and slave


On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote:


 On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote:

  doesn't it do a commit to force solr to recognize the changes?

 yes.

 - Mark



Re: Boost Specific Phrase

2013-02-13 Thread Amit Nithian
Ah yes sorry mis-understood. Another option is to use n-grams so that
projectmanager is a term so any query involving project manager in india
with 2 years experience would match higher because the query would contain
projectmanager as a term.


On Wed, Feb 13, 2013 at 9:56 PM, Hemant Verma hemantverm...@gmail.comwrote:

 Thanks for the response.

 pf parameter actually boost the documents considering all search keywords
 mentioned in main query but I am looking for something which boost the
 documents considering few search keywords from the user query.
 Like as per the example, user query is (project manager in India with 2 yrs
 experience) and my dictionary contains one entry as 'project manager' which
 specifies if user query has 'project manager' in his query then boost those
 documents which contains 'project manager' as an exact match.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188p4040371.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr HTTP Replication Question

2013-01-25 Thread Amit Nithian
Okay one last note... just for closure... looks like it was addressed in
solr 4.1+ (I was looking at 4.0).


On Thu, Jan 24, 2013 at 11:14 PM, Amit Nithian anith...@gmail.com wrote:

 Okay so after some debugging I found the problem. While the replication
 piece will download the index from the master server and move the files to
 the index directory but during the commit phase, these older generation
 files are deleted and the index is essentially left in tact.

 I noticed that a full copy is needed if the index is stale (meaning that
 files in common between the master and slave have different sizes) but also
 I think a full copy should be needed if the slaves generation is higher
 than the master as well. In short, to me it's not sufficient enough to
 simply say a full copy is needed if the slave's index version is =
 master's index version. I'll create a patch and file a bug along with a
 more thorough writeup of how I got in this state.

 Thanks!
 Amit



 On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote:

 Does Solr's replication look at the generation difference between master
 and slave when determining whether or not to replicate?

 To be more clear:
 What happens if a slave's generation is higher than the master yet the
 slave's index version is less than the master's index version?

 I looked at the source and didn't seem to see any reason why the
 generation matters other than fetching the file list from the master for a
 given generation. It's too wordy to explain how this happened so I'll go
 into details on that if anyone cares.

 Thanks!
 Amit





Re: Solr HTTP Replication Question

2013-01-24 Thread Amit Nithian
Okay so after some debugging I found the problem. While the replication
piece will download the index from the master server and move the files to
the index directory but during the commit phase, these older generation
files are deleted and the index is essentially left in tact.

I noticed that a full copy is needed if the index is stale (meaning that
files in common between the master and slave have different sizes) but also
I think a full copy should be needed if the slaves generation is higher
than the master as well. In short, to me it's not sufficient enough to
simply say a full copy is needed if the slave's index version is =
master's index version. I'll create a patch and file a bug along with a
more thorough writeup of how I got in this state.

Thanks!
Amit



On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote:

 Does Solr's replication look at the generation difference between master
 and slave when determining whether or not to replicate?

 To be more clear:
 What happens if a slave's generation is higher than the master yet the
 slave's index version is less than the master's index version?

 I looked at the source and didn't seem to see any reason why the
 generation matters other than fetching the file list from the master for a
 given generation. It's too wordy to explain how this happened so I'll go
 into details on that if anyone cares.

 Thanks!
 Amit



Re: group.ngroups behavior in response

2013-01-17 Thread Amit Nithian
A new response attribute would be better but it also complicates the patch
in that it would require a new way to serialize DocSlices I think
(especially when group.main=true)? I was looking to set group.main=true so
that my existing clients don't have to change to parse the grouped
resultset format.

Secondly, while a new response attribute makes sense the question is
whether or not numFound is the numGroups or numTotal. To me it should be
the number of groups because logically that is what the resultset shows and
the new attribute should point to the number of total.

Thanks
Amit


group.ngroups behavior in response

2013-01-16 Thread Amit Nithian
Hi all,

I recently discovered the group.main=true/false parameter which really has
made life simple in terms of ensuring that the format coming out of Solr
for my clients (RoR app) is backwards compatible with the non-grouped
results which ensures no special handle grouped results logic.

The only issue though is that the numFound is the number of total matches
instead of the number of groups which can seem odd (and incorrect if you
rely on the numFound to determine whether or not to display a next page
link).

I created a JIRA issue, SOLR-4310, and submitted a patch for this and
wanted to get feedback to see if this is an issue that others have
encountered and if so, would this help.

Thanks
Amit


Re: Grouping by a date field

2012-11-29 Thread Amit Nithian
Why not create a new field that just contains the day component? Then you
can group by this field.


On Thu, Nov 29, 2012 at 12:38 PM, sdanzig sdan...@gmail.com wrote:

 I'm trying to create a SOLR query that groups/field collapses by date.  I
 have a field in -MM-dd'T'HH:mm:ss'Z' format, datetime, and I'm
 looking
 to group by just per day.  When grouping on this field using
 group.field=datetime in the query, SOLR responds with a group for every
 second.  I'm able to easily use this field to create day-based facets, but
 not groups.  Advice please?

 - Scott



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Grouping-by-a-date-field-tp4023318.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Grouping by a date field

2012-11-29 Thread Amit Nithian
What's the performance impact of doing this?


On Thu, Nov 29, 2012 at 7:54 PM, Jack Krupansky j...@basetechnology.comwrote:

 Or group by a function query which is the date field converted to
 milliseconds divided by the number of milliseconds in a day.

 Such as:

  q=*:*group=truegroup.func=**rint(div(ms(date_dt),mul(24,**
 mul(60,mul(60,1000)

 -- Jack Krupansky

 -Original Message- From: Amit Nithian
 Sent: Thursday, November 29, 2012 10:29 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Grouping by a date field


 Why not create a new field that just contains the day component? Then you
 can group by this field.


 On Thu, Nov 29, 2012 at 12:38 PM, sdanzig sdan...@gmail.com wrote:

  I'm trying to create a SOLR query that groups/field collapses by date.  I
 have a field in -MM-dd'T'HH:mm:ss'Z' format, datetime, and I'm
 looking
 to group by just per day.  When grouping on this field using
 group.field=datetime in the query, SOLR responds with a group for every
 second.  I'm able to easily use this field to create day-based facets, but
 not groups.  Advice please?

 - Scott



 --
 View this message in context:
 http://lucene.472066.n3.**nabble.com/Grouping-by-a-date-**
 field-tp4023318.htmlhttp://lucene.472066.n3.nabble.com/Grouping-by-a-date-field-tp4023318.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: is there a way to prevent abusing rows parameter

2012-11-26 Thread Amit Nithian
If you're going to validate the rows parameter, may as well validate the
start parameter too.. I've run into problems with start and rows with
ridiculously high values crash our servers.


On Thu, Nov 22, 2012 at 9:58 AM, solr-user solr-u...@hotmail.com wrote:

 Thanks guys.  This is a problem with the front end not validating requests.
 I was hoping there might be a simple config value I could enter/change,
 rather than going the long process of migrating a proper fix all the way up
 to our production servers.  Looks like not, but thx.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/is-there-a-way-to-prevent-abusing-rows-parameter-tp4021467p4021892.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Search among multiple cores

2012-11-26 Thread Amit Nithian
You can simplify your code by searching across cores in the SearchComponent:
1) public class YourComponent implements SolrCoreAware
-- Grab instance of CoreContainer and store (mCoreContainer =
core.getCoreDescriptor().getCoreContainer();)
2) In the process method:
* grab the core requested (SolrCore core
= mCoreContainer.getCore(sCoreName);)

This way you can avoid having to implement the listener you mentioned and
passing this in the servlet config.


On Mon, Nov 26, 2012 at 7:28 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Would http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer save you some
 work?

 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html




 On Mon, Nov 26, 2012 at 7:18 PM, Nicholas Ding nicholas...@gmail.com
 wrote:

  Hi,
 
  I'm working on a search engine project based on Solr. Now I have three
  cores (Core A, B, C). I need to search Core A and Core B to get required
  parameters to search Core C. So far, I wrote a SearchComponent which uses
  SolrJ inside because I can't access other cores directly in
  SearchComponent. I was bit worried about performance and scalability
  because SolrJ brings little HTTP overhead.
 
  After digging into the Solr's source code, I wrote a SolrContextListener
 to
  initialize CoreContainer at server startup then put it into a
  ServlerContext. Then I wrote another Servlet to get a reference from
  ServletContext and now I'm able to get all the Core references in Java.
 
  The good part is I can access all Solr's internal structure in Java, but
  the bad part is I have to deal with internal types which requires deep
  understanding of Solr's source code.
 
  I was wondering if anybody had done similar things before? What's the
 side
  effects of extending Solr in code level?
 
  Thanks
  Nicholas
 



Re: custom request handler

2012-11-11 Thread Amit Nithian
Hi Lee,

So the query component would be a subclass of SearchComponent and you can
define the list of components executed during a search handler.
http://wiki.apache.org/solr/SearchComponent

I *think* you can have a custom component do what you want as long as it's
the first component in the list so you can inspect and re-set the
parameters before it goes downstream to the other components. However, it's
still not clear how you are going to prevent users from POSTing bad queries
or looking at things they probably shouldn't be like the schema.xml or
solrconfig.xml or the admin console. Maybe there are ways in Solr to
prevent this but then you'd have to allow it for internal admins but
exclude it for the public.

If you are exposing your slaves to the actual world wide public then I'd
strongly suggest an app layer between solr and the public. I treat Solr
like my database meaning that I don't expose access to my database publicly
but rather through some app layer (say some CMS tools or what not).

HTH!
Amit


On Sun, Nov 11, 2012 at 5:23 AM, Lee Carroll
lee.a.carr...@googlemail.comwrote:

 Only slaves are public facing and they are read only, with limited query
 request handlers defined. The above approach is to prevent abusive / in
 appropriate queries by clients. A query component sounds interesting would
 this be implemented through an interface so could be separate from solr or
 would it be sub classing a base component ?

 cheers lee c


 On 9 November 2012 17:24, Amit Nithian anith...@gmail.com wrote:

  Lee,
 
  I guess my question was if you are trying to prevent the big bad world
  from doing stuff they aren't supposed to in Solr, how are you going to
  prevent the big bad world from POSTing a delete all query? Or restrict
  them from hitting the admin console, looking at the schema.xml,
  solrconfig.xml.
 
  I guess the question here is who is the big bad world? The internet at
  large or employees/colleagues in your organization? If it's the internet
 at
  large then I'd totally decouple this from Solr b/c I want to be 100% sure
  that the *only* thing that the internet has access to is a GET on /select
  with some restrictions and this could be done in many places but it's not
  clear that coupling this to Solr is the place to do it.
 
  If the big bad world is just within your organization and you want some
  basic protections around what they can and can't see then what you are
  saying is reasonable to me. Also perhaps another option is to consider a
  query component rather than creating a sublcass of the request handler
 as a
  query component promotes more re-use and flexibility. You could make the
  necessary parameter changes in the prepare() method and just make sure
 that
  this safe parameter component comes before the query component in the
  list of components for a handler and you should be fine.
 
  Cheers!
  Amit
 
 
  On Fri, Nov 9, 2012 at 5:39 AM, Lee Carroll 
 lee.a.carr...@googlemail.com
  wrote:
 
   Hi Amit
  
   I did not do this via a servlet filter as I wanted the solr devs to be
   concerned with solr config and keep them out of any concerns of the
   container. By specifying declarative data in a request handler that
 would
   be enough to produce a service uri for an application.
  
   Or have  I missed a point ? We have several cores with several apps all
   with different data query needs. Maybe 20 request handlers needed to
   support this with active development on going. Basically I want it easy
  for
   devs to create a specific request handler suited to their needs. I
  thought
   a servletfilter developed and mainatined every time would be over kill.
   Again though I may have missed a point / over emphasised a difficulty?
  
   Are you saying my custom request handler is to tightly bound to solr?
 so
   the parameters my apps talk is not de-coupled enough from solr?
  
   Lee C
  
   On 7 November 2012 19:49, Amit Nithian anith...@gmail.com wrote:
  
Why not do this in a ServletFilter? Alternatively, I'd just write a
  front
end application servlet to do this so that you don't firewall your
   internal
admins off from accessing the core Solr admin pages. I guess you
 could
solve this using some form of security but I don't know this well
  enough.
   
If I were to restrict access to certain parts of Solr, I'd do this
   outside
of Solr itself and do this in a servlet or a filter, inspecting the
parameters. It's easy to create a modifiable parameters class and
populate that with acceptable parameters before the Solr filter
  operates
   on
it.
   
HTH
Amit
   
   
   
   
  
 



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-11 Thread Amit Nithian
Jack,

I think the issue is that the ping which is used to determine whether or
not the server is live returns a seemingly false positive back to the load
balancer (and indirectly the client) that this server is ready to go when
in fact it's not. Reading this page (
http://wiki.apache.org/solr/SolrConfigXml), it does seem to be documented
to do this but it may not be fully stressed to hide your Solr behind a load
balancer.  I am more than happy to write up a post that, in my opinion at
least, stresses some best practices on the use of Solr based on my
experience if others find this useful.

What seems odd here is that the ping is a query so maybe the ping query in
the solrconfig (for Aaron and others having this) should be configured to
hit the handler that is used by the front end app so that while that
handler is warming up the ping query will be blocked.

Of course using the load balancer means that the app layer knows nothing
about servers in and out of rotation.

Cheers!
Amit


On Sun, Nov 11, 2012 at 8:05 AM, Jack Krupansky j...@basetechnology.comwrote:

 Is the issue here that the Solr node is continuously live with the load
 balancer so that the moment during startup that Solr can respond to
 anything, the load balancer will be sending it traffic and that this can
 occur while Solr is still warming up?

 First, shouldn't we be encouraging people to have an app layer between
 Solr and the outside world? If so, the app layer should simply not respond
 to traffic until the app layer can verified that Solr has stabilized. If
 not, then maybe we do need to suggest a change to Solr so that the
 developer can control exactly when Solr becomes live and responsive to
 incoming traffic.

 At a minimum, we should document when that moment is today in terms of an
 explicit contract. It sounds like the problem is that the contract is
 either nonexistent, vague, ambiguous, non-deterministic, or whatever.

 -- Jack Krupansky

 -Original Message- From: Amit Nithian
 Sent: Saturday, November 10, 2012 4:24 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Preventing accepting queries while custom QueryComponent
 starts up?


 Yeah that's what I was suggesting in my response too. I don't think your
 load balancer should be doing this but whatever script does the release
 (restarting the container) should do this so that when the ping is enabled
 the warming has finished.


 On Sat, Nov 10, 2012 at 3:33 PM, Erick Erickson erickerick...@gmail.com*
 *wrote:

  Hmmm, rather than hit the ping query, why not just send in a real query
 and
 only let the queued ones through after the response?

 Just a random thought
 Erick


 On Sat, Nov 10, 2012 at 2:53 PM, Amit Nithian anith...@gmail.com wrote:

  Yes but the problem is that if user facing queries are hitting a server
  that is warming up and isn't being serviced quickly, then you could
  potentially bring down your site if all the front end threads are 
 blocked
  on Solr queries b/c those queries are waiting (presumably at the
 container
  level since the filter hasn't finished its init() sequence) for the
 warming
  to complete (this is especially notorious when your front end is rails).
  This is why your ping to enable/disable a server from the load balancer
 has
  to be accurate with regards to whether or not a server is truly ready 
 and
  warm.
 
  I think what I am gathering from this discussion is that the server is
  warming up, the ping is going through and tells the load balancer this
  server is ready, user queries are hitting this server and are queued
  waiting for the firstSearcher to finish (say these initial user queries
 are
  to respond in 500-1000ms) that's terrible for performance.
 
  Alternatively, if you have a bunch of servers behind a load balancer, 
 you
  want this one server (or block of servers depending on your deployment)
 to
  be reasonably sure that user queries will return in a decent time
 (whatever
  you define decent to be) hence why this matters.
 
  Let me know if I am missing anything.
 
  Thanks
  Amit
 
 
  On Sat, Nov 10, 2012 at 10:03 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
   Why does it matter? The whole idea of firstSearcher queries is to warm
 up
   your system as fast as possible. The theory is that upon restarting 
  the
   server, let's bet this stuff going immediately... They were never
  intended
   (as far as I know) to complete before any queries were handled. As an
   aside, I'm not quite sure I understand why pings during the warmup are
 a
   problem.
  
   But anyway. firstSearcher is particularly relevant because the
   autowarmCount settings on your caches are irrelevant when starting the
   server, there's no history to autowarm
  
   But, there's no good reason _not_ to let queries through while
   firstSearcher is doing it's tricks, they just get into the queue and
 are
   served as quickly as they may. That might be some time since, as you
 say,
   they may not get serviced

Re: 4.0 query question

2012-11-11 Thread Amit Nithian
Why not group by cid using the grouping component, within the group sort by
version descending and return 1 result per group.

http://wiki.apache.org/solr/FieldCollapsing

Cheers
Amit


On Fri, Nov 9, 2012 at 2:56 PM, dm_tim dm_...@yahoo.com wrote:

 I think I may have found my answer buy I'd like additional validation:
 I believe that I can add a function to my query to get only the highest
 values of 'file_version' like this -
 _val_:max(file_version, 1)

 I seem to be getting the results I want. Does this look correct?

 Regards,

 Tim



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/4-0-query-question-tp4019397p4019426.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-10 Thread Amit Nithian
Yes but the problem is that if user facing queries are hitting a server
that is warming up and isn't being serviced quickly, then you could
potentially bring down your site if all the front end threads are blocked
on Solr queries b/c those queries are waiting (presumably at the container
level since the filter hasn't finished its init() sequence) for the warming
to complete (this is especially notorious when your front end is rails).
This is why your ping to enable/disable a server from the load balancer has
to be accurate with regards to whether or not a server is truly ready and
warm.

I think what I am gathering from this discussion is that the server is
warming up, the ping is going through and tells the load balancer this
server is ready, user queries are hitting this server and are queued
waiting for the firstSearcher to finish (say these initial user queries are
to respond in 500-1000ms) that's terrible for performance.

Alternatively, if you have a bunch of servers behind a load balancer, you
want this one server (or block of servers depending on your deployment) to
be reasonably sure that user queries will return in a decent time (whatever
you define decent to be) hence why this matters.

Let me know if I am missing anything.

Thanks
Amit


On Sat, Nov 10, 2012 at 10:03 AM, Erick Erickson erickerick...@gmail.comwrote:

 Why does it matter? The whole idea of firstSearcher queries is to warm up
 your system as fast as possible. The theory is that upon restarting the
 server, let's bet this stuff going immediately... They were never intended
 (as far as I know) to complete before any queries were handled. As an
 aside, I'm not quite sure I understand why pings during the warmup are a
 problem.

 But anyway. firstSearcher is particularly relevant because the
 autowarmCount settings on your caches are irrelevant when starting the
 server, there's no history to autowarm

 But, there's no good reason _not_ to let queries through while
 firstSearcher is doing it's tricks, they just get into the queue and are
 served as quickly as they may. That might be some time since, as you say,
 they may not get serviced until the expensive parts get filled. But I don't
 think having them be serviced is doing any harm.

 Now, newSearcher and autowarming of the caches is a completely different
 beast since having the old searchers continue serving requests until the
 warmups _does_ directly impact the user, they don't see random slowness
 because a searcher is being opened.

 So I guess my real question is whether you're seeing a measurable problem
 or if this is a red herring

 FWIW,
 Erick


 On Thu, Nov 8, 2012 at 2:54 PM, Aaron Daubman daub...@gmail.com wrote:

  Greetings,
 
  I have several custom QueryComponents that have high one-time startup
 costs
  (hashing things in the index, caching things from a RDBMS, etc...)
 
  Is there a way to prevent solr from accepting connections before all
  QueryComponents are ready?
 
  Especially, since many of our instance are load-balanced (and
  added-in/removed automatically based on admin/ping responses) preventing
  ping from answering prior to all custom QueryComponents being ready would
  be ideal...
 
  Thanks,
   Aaron
 



Re: My latest solr blog post on Solr's PostFiltering

2012-11-09 Thread Amit Nithian
Oh weird. I'll post URLs on their own lines next time to clarify.

Thanks guys and looking forward to any feedback!

Cheers
Amit


On Fri, Nov 9, 2012 at 2:05 AM, Dmitry Kan dmitry@gmail.com wrote:

 I guess the url should have been:


 http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.html

 i.e. without 'and' in the end of it.

 -- Dmitry

 On Fri, Nov 9, 2012 at 12:03 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  It's always good when someone writes up their experiences!
 
  But when I try to follow that link, I get to your Random Writings, but
 it
  tells me that the blog post doesn't exist...
 
  Erick
 
 
  On Thu, Nov 8, 2012 at 4:21 PM, Amit Nithian anith...@gmail.com wrote:
 
   Hey all,
  
   I wanted to thank those who have helped in answering some of my
 esoteric
   questions and especially the one about using Solr's post filtering
  feature
   to implement some score statistics gathering we had to do at Zvents.
  
   To show this appreciation and to help advance the knowledge of this
 space
   in a more codified fashion, I have written a blog post about this work
  and
   open sourced the work as well.
  
   Please take a read by visiting
  
  
 
 http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.htmland
   please let me know if there are any inaccuracies or points of
   contention so I can address/correct them.
  
   Thanks!
   Amit
  
 



 --
 Regards,

 Dmitry Kan



Re: custom request handler

2012-11-09 Thread Amit Nithian
Lee,

I guess my question was if you are trying to prevent the big bad world
from doing stuff they aren't supposed to in Solr, how are you going to
prevent the big bad world from POSTing a delete all query? Or restrict
them from hitting the admin console, looking at the schema.xml,
solrconfig.xml.

I guess the question here is who is the big bad world? The internet at
large or employees/colleagues in your organization? If it's the internet at
large then I'd totally decouple this from Solr b/c I want to be 100% sure
that the *only* thing that the internet has access to is a GET on /select
with some restrictions and this could be done in many places but it's not
clear that coupling this to Solr is the place to do it.

If the big bad world is just within your organization and you want some
basic protections around what they can and can't see then what you are
saying is reasonable to me. Also perhaps another option is to consider a
query component rather than creating a sublcass of the request handler as a
query component promotes more re-use and flexibility. You could make the
necessary parameter changes in the prepare() method and just make sure that
this safe parameter component comes before the query component in the
list of components for a handler and you should be fine.

Cheers!
Amit


On Fri, Nov 9, 2012 at 5:39 AM, Lee Carroll lee.a.carr...@googlemail.comwrote:

 Hi Amit

 I did not do this via a servlet filter as I wanted the solr devs to be
 concerned with solr config and keep them out of any concerns of the
 container. By specifying declarative data in a request handler that would
 be enough to produce a service uri for an application.

 Or have  I missed a point ? We have several cores with several apps all
 with different data query needs. Maybe 20 request handlers needed to
 support this with active development on going. Basically I want it easy for
 devs to create a specific request handler suited to their needs. I thought
 a servletfilter developed and mainatined every time would be over kill.
 Again though I may have missed a point / over emphasised a difficulty?

 Are you saying my custom request handler is to tightly bound to solr? so
 the parameters my apps talk is not de-coupled enough from solr?

 Lee C

 On 7 November 2012 19:49, Amit Nithian anith...@gmail.com wrote:

  Why not do this in a ServletFilter? Alternatively, I'd just write a front
  end application servlet to do this so that you don't firewall your
 internal
  admins off from accessing the core Solr admin pages. I guess you could
  solve this using some form of security but I don't know this well enough.
 
  If I were to restrict access to certain parts of Solr, I'd do this
 outside
  of Solr itself and do this in a servlet or a filter, inspecting the
  parameters. It's easy to create a modifiable parameters class and
  populate that with acceptable parameters before the Solr filter operates
 on
  it.
 
  HTH
  Amit
 
 
 
 



Re: is it possible to save the search query?

2012-11-08 Thread Amit Nithian
Are you trying to do this in real time or offlline? Wouldn't mining your
access logs help? It may help to have your front end application pass in
some extra parameters that are not interpreted by Solr but are there for
stamping purposes for log analysis. One example could be a user id or
user cookie or something in case you have to construct sessions.


On Wed, Nov 7, 2012 at 10:01 PM, Romita Saha
romita.s...@sg.panasonic.comwrote:

 Hi,

 The following is the example;
 1st query:


 http://localhost:8983/solr/db/select/?defType=dismaxdebugQuery=onq=cashier2qf=data
 ^2
 idstart=0rows=11fl=data,id

 Next query:


 http://localhost:8983/solr/db/select/?defType=dismaxdebugQuery=onq=cashier2qf=data
 id^2start=0rows=11fl=data,id

 In the 1st query the the field 'data' is boosted by 2. However may be the
 user was not satisfied with the response. Thus in the next query he
 boosted the field 'id' by 2.

 I want to record both the queries and compare between the two, meaning,
 what are the changes implemented on the 2nd query which are not present in
 the previous one.

 Thanks and regards,
 Romita Saha



 From:   Otis Gospodnetic otis.gospodne...@gmail.com
 To: solr-user@lucene.apache.org,
 Date:   11/08/2012 01:35 PM
 Subject:Re: is it possible to save the search query?



 Hi,

 Compare in what sense?  An example will help.

 Otis
 --
 Performance Monitoring - http://sematext.com/spm
 On Nov 7, 2012 8:45 PM, Romita Saha romita.s...@sg.panasonic.com
 wrote:

  Hi All,
 
  Is it possible to record a search query in solr and then compare it with
  the previous search query?
 
  Thanks and regards,
  Romita Saha
 




Re: Searching for Partial Words

2012-11-08 Thread Amit Nithian
Look at the normal ngram tokenizer. Engine with ngram size 3 would yield
eng ngi gin ine so a search for engi should match. You can play
around with the min/max values. Edge ngram is useful for prefix matching
but sounds like you want intra-word matching too? (eng should match 
ResidentEngineer)


On Tue, Nov 6, 2012 at 7:35 AM, Sohail Aboobaker sabooba...@gmail.comwrote:

 Thanks Jack.
 In the configuration below:

  fieldType name=text_edgngrm class=solr.TextField
 positionIncrementGap=100
analyzer
  tokenizer class=solr.EdgeNGramTokenizerFactory side=front
 minGramSize=1 maxGramSize=1/
/analyzer
  /fieldType

 What are the possible values for side?

 If I understand it correctly, minGramSize=3 and side=front, will
 include eng* but not en*. Is this correct? So, the minGramSize is for
 number of characters allowed in the specified side.

 Does it allow side=both :) or something similar?

 Regards,
 Sohail



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Amit Nithian
I think Solr does this by default and are you executing warming queries in
the firstSearcher so that these actions are done before Solr is ready to
accept real queries?


On Thu, Nov 8, 2012 at 11:54 AM, Aaron Daubman daub...@gmail.com wrote:

 Greetings,

 I have several custom QueryComponents that have high one-time startup costs
 (hashing things in the index, caching things from a RDBMS, etc...)

 Is there a way to prevent solr from accepting connections before all
 QueryComponents are ready?

 Especially, since many of our instance are load-balanced (and
 added-in/removed automatically based on admin/ping responses) preventing
 ping from answering prior to all custom QueryComponents being ready would
 be ideal...

 Thanks,
  Aaron



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Amit Nithian
Sorry I misunderstood. I am having difficulty finding this but it's never
clear the exact load order. It seems odd that you'd be getting requests
when the filter (DispatchFilter) hasn't 100% loaded yet.

I didn't think that the admin handler would allow requests while the
dispatch filter is still init'ing but sounds like it is? I'll have to play
with this to see.. curious what the problem is for we have a similar setup
but not as bad of an init problem (plus when I deploy, my deploy script
runs some actual simple test queries to ensure they return before enabling
the ping handler to return 200s) to avoid this problem.

Cheers
Amit


On Thu, Nov 8, 2012 at 1:33 PM, Aaron Daubman daub...@gmail.com wrote:

 Amit,

 I am using warming /firstSearcher queries to ensure this happens before any
 external queries are received, however, unless I am misinterpreting the
 logs, solr starts responding to admin/ping requests before firstSearcher
 completes, and, the LB then puts the solr instance back in the pool, and it
 starts accepting connections...


 On Thu, Nov 8, 2012 at 4:24 PM, Amit Nithian anith...@gmail.com wrote:

  I think Solr does this by default and are you executing warming queries
 in
  the firstSearcher so that these actions are done before Solr is ready to
  accept real queries?
 
 
  On Thu, Nov 8, 2012 at 11:54 AM, Aaron Daubman daub...@gmail.com
 wrote:
 
   Greetings,
  
   I have several custom QueryComponents that have high one-time startup
  costs
   (hashing things in the index, caching things from a RDBMS, etc...)
  
   Is there a way to prevent solr from accepting connections before all
   QueryComponents are ready?
  
   Especially, since many of our instance are load-balanced (and
   added-in/removed automatically based on admin/ping responses)
 preventing
   ping from answering prior to all custom QueryComponents being ready
 would
   be ideal...
  
   Thanks,
Aaron
  
 



Re: Preventing accepting queries while custom QueryComponent starts up?

2012-11-08 Thread Amit Nithian
Hi Aaron,

Check out
http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/handler/PingRequestHandler.html
You'll see the ?action=enable/disable. I have our load balancers remove the
server out of rotation when the response code != 200 for some number of
times in a row which I suspect you are doing too. If I am rolling releasing
our search code to production, it gets disabled, sleep for some known
number of seconds for the LB to yank the search server out of rotation,
push the code, execute some queries using CURL to ensure a response (the
warming process should block the request until done) and then enable.

HTH!
Amit


On Thu, Nov 8, 2012 at 2:01 PM, Aaron Daubman daub...@gmail.com wrote:

   (plus when I deploy, my deploy script
  runs some actual simple test queries to ensure they return before
 enabling
  the ping handler to return 200s) to avoid this problem.
 

 What are you doing to programmatically disable/enable the ping handler?
 This sounds like exactly what I should be doing as well...



Re: custom request handler

2012-11-07 Thread Amit Nithian
Why not do this in a ServletFilter? Alternatively, I'd just write a front
end application servlet to do this so that you don't firewall your internal
admins off from accessing the core Solr admin pages. I guess you could
solve this using some form of security but I don't know this well enough.

If I were to restrict access to certain parts of Solr, I'd do this outside
of Solr itself and do this in a servlet or a filter, inspecting the
parameters. It's easy to create a modifiable parameters class and
populate that with acceptable parameters before the Solr filter operates on
it.

HTH
Amit


On Tue, Nov 6, 2012 at 6:46 AM, Lee Carroll lee.a.carr...@googlemail.comwrote:

 Hi we are extending SearchHandler to provide a custom search request
 handler. Basically we've added NamedLists called allowed , whiteList,
 maxMinList etc.

 These look like the default, append and invariant namedLists in the
 standard search handler config. In handleRequestBody we then remove params
 not listed in the allowed named list, white list values as per the white
 list and so on.

 The idea is to have a safe request handler which the big bad world could
 be exposed to. I'm worried. What have we missed that a front end app could
 give us ?

 Also removing params in SolrParams is a bit clunky. We are basically
 converting SolrParams into NamedList processing a new NamedList from this
 and then .setParams(SolrParams.toSolrParams(nlNew)) Is their a better way?
 In particular namedLists are not set up for key look ups...

 Anyway basically is having a custom request handler doing the above the way
 to go ?

 Cheers



Re: Urgent Help Needed: Solr Data import problem

2012-10-30 Thread Amit Nithian
This error is typically because of a mysql permissions problem. These
are usually resolved by a GRANT statement on your DB to allow for
users to connect remotely to your database server.

I don't know the full syntax but a quick search on Google should yield
what you are looking for. If you don't control access to this DB, talk
to your sys admin who does maintain this access and s/he should be
able to help resolve this.

On Tue, Oct 30, 2012 at 7:13 AM, Travis Low t...@4centurion.com wrote:
 Like Amit said, this appears not to be a Solr problem. From the command
 line of your machine, try this:

 mysql -u'readonly' -p'readonly' -h'10.86.29.32' hpcms_db_new

 If that works, and 10.86.29.32 is the server referenced by the URL in your
 data-config.xml problem, then at least you know you have database
 connectivity, and to the right server.

 Also, if your unix server (presumably your mysql server) is 10.86.29.32,
 then the URL in your data-config.xml is pointing to the wrong machine.  If
 the one in the data-config.xml is correct, you need to test for
 connectivity to that machine instead.

 cheers,

 Travis

 On Tue, Oct 30, 2012 at 5:15 AM, kunal sachdeva 
 kunalsachde...@gmail.comwrote:

 Hi,

 This is my data-config file:-

 dataConfig

   dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://
 172.16.37.160:3306/hpcms_db_new user=readonly password=readonly/

   document

 entity
 name=package query=select concat('pckg', id) as id,pkg_name,updated_time
 from hp_package_info;
 /entity

 entity
 name=destination
  query=select name,id from hp_city
  field column=name name=dest_name/
 /entity
 !--
 entity
 name=theme
  query=select id,name from hp_themes
  field column=name name=theme_name/
 /entity
 --
   /document
 /dataConfig


 and password is not null. and 10.86.29.32 is my unix server ip.

 regards,
 kunal

 On Tue, Oct 30, 2012 at 2:42 PM, Dave Stuart d...@axistwelve.com wrote:

  It looks as though you have a password set on your unix server. you will
  need to either remove this or ti add the password into the connection
 string
 
  e.g. readonly:[yourpassword]@'10.86.29.32'
 
 
 
   'readonly'@'10.86.29.32'
   (using password: NO)
  On 30 Oct 2012, at 09:08, kunal sachdeva wrote:
 
   Hi,
  
   I'm not getting this error while running in local machine. Please Help
  
   Regards,
   Kunal
  
   On Tue, Oct 30, 2012 at 10:32 AM, Amit Nithian anith...@gmail.com
  wrote:
  
   This looks like a MySQL permissions problem and not a Solr problem.
   Caused by: java.sql.SQLException: Access denied for user
   'readonly'@'10.86.29.32'
   (using password: NO)
  
   I'd advise reading your stack traces a bit more carefully. You should
   check your permissions or if you don't own the DB, check with your DBA
   to find out what user you should use to access your DB.
  
   - Amit
  
   On Mon, Oct 29, 2012 at 9:38 PM, kunal sachdeva
   kunalsachde...@gmail.com wrote:
   Hi,
  
   I have tried using data-import in my local system. I was able to
  execute
   it
   properly. but when I tried to do it unix server I got following
 error:-
  
  
   INFO: Starting Full Import
   Oct 30, 2012 9:40:49 AM
   org.apache.solr.handler.dataimport.SimplePropertiesWriter
   readIndexerProperties
   WARNING: Unable to read: dataimport.properties
   Oct 30, 2012 9:40:49 AM org.apache.solr.update.DirectUpdateHandler2
   deleteAll
   INFO: [core0] REMOVING ALL DOCUMENTS FROM INDEX
   Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy
 onInit
   INFO: SolrDeletionPolicy.onInit: commits:num=1
  
  
  
 
 commit{dir=/opt/testsolr/multicore/core0/data/index,segFN=segments_1,version=1351490646879,generation=1,filenames=[segments_1]
   Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy
   updateCommits
   INFO: newest commit = 1351490646879
   Oct 30, 2012 9:40:49 AM
   org.apache.solr.handler.dataimport.JdbcDataSource$1
   call
   INFO: Creating a connection for entity destination with URL:
   jdbc:mysql://
   172.16.37.160:3306/hpcms_db_new
   Oct 30, 2012 9:40:50 AM org.apache.solr.common.SolrException log
   SEVERE: Exception while processing: destination document :
  
  
 
 SolrInputDocument[{}]:org.apache.solr.handler.dataimport.DataImportHandlerException:
   Unable to execute query: select name,id from hp_city Processing
  Document
   # 1
  at
  
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
   Caused by: java.lang.RuntimeException:
   org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
  to
   execute query: select name,id from hp_city Processing Document # 1
  at
  
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument

Re: Any way to by pass the checking on QueryElevationComponent

2012-10-29 Thread Amit Nithian
Is the goal to have the elevation data read from somewhere else? In
other words, why don't you want the elevate.xml to exist locally?

If you want to read the data from somewhere else, could you put a
dummy elevate.xml locally and subclass the QueryElevationComponent and
override the loadElevationMap() to read this data from your own custom
location?

On Fri, Oct 26, 2012 at 6:47 PM, James Ji jiayu...@gmail.com wrote:
 Hi there

 We are currently working on having Solr files read from HDFS. We extended
 some of the classes so as to avoid modifying the original Solr code and
 make it compatible with the future release. So here comes the question, I
 found in QueryElevationComponent, there is a piece of code checking whether
 elevate.xml exists at local file system. I am wondering if there is a way
 to by pass this?
 QueryElevationComponent.inform(){
 
 File fC = new File(core.getResourceLoader().getConfigDir(), f);
 File fD = new File(core.getDataDir(), f);
 if (fC.exists() == fD.exists()) { throw new
 SolrException(SolrException.ErrorCode.SERVER_ERROR,
 QueryElevationComponent missing config file: ' + f + \n + either:  +
 fC.getAbsolutePath() +  or  + fD.getAbsolutePath() +  must exist, but
 not both.); }
 if (fC.exists()) { exists = true; log.info(Loading QueryElevation from:
 +fC.getAbsolutePath()); Config cfg = new Config(core.getResourceLoader(),
 f); elevationCache.put(null, loadElevationMap(cfg)); }
 
 }

 --
 Jiayu (James) Ji,

 ***

 Cell: (312)823-7393
 Website: https://sites.google.com/site/jiayuji/

 ***


Re: Urgent Help Needed: Solr Data import problem

2012-10-29 Thread Amit Nithian
This looks like a MySQL permissions problem and not a Solr problem.
Caused by: java.sql.SQLException: Access denied for user
'readonly'@'10.86.29.32'
(using password: NO)

I'd advise reading your stack traces a bit more carefully. You should
check your permissions or if you don't own the DB, check with your DBA
to find out what user you should use to access your DB.

- Amit

On Mon, Oct 29, 2012 at 9:38 PM, kunal sachdeva
kunalsachde...@gmail.com wrote:
 Hi,

 I have tried using data-import in my local system. I was able to execute it
 properly. but when I tried to do it unix server I got following error:-


 INFO: Starting Full Import
 Oct 30, 2012 9:40:49 AM
 org.apache.solr.handler.dataimport.SimplePropertiesWriter
 readIndexerProperties
 WARNING: Unable to read: dataimport.properties
 Oct 30, 2012 9:40:49 AM org.apache.solr.update.DirectUpdateHandler2
 deleteAll
 INFO: [core0] REMOVING ALL DOCUMENTS FROM INDEX
 Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1

 commit{dir=/opt/testsolr/multicore/core0/data/index,segFN=segments_1,version=1351490646879,generation=1,filenames=[segments_1]
 Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: newest commit = 1351490646879
 Oct 30, 2012 9:40:49 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
 call
 INFO: Creating a connection for entity destination with URL: jdbc:mysql://
 172.16.37.160:3306/hpcms_db_new
 Oct 30, 2012 9:40:50 AM org.apache.solr.common.SolrException log
 SEVERE: Exception while processing: destination document :
 SolrInputDocument[{}]:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query: select name,id from hp_city Processing Document # 1
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
 Caused by: java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: select name,id from hp_city Processing Document # 1
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
 ... 3 more
 Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query: select name,id from hp_city Processing Document # 1
 at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
 ... 5 more
 Caused by: java.sql.SQLException: Access denied for user
 'readonly'@'10.86.29.32'
 (using password: NO)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910)
 at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:3923)
 at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1273)
 at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2031)
 at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:718)
 at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:46)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at 

Re: Monitor Deleted Event

2012-10-24 Thread Amit Nithian
I'm not 100% sure about this but looks like update processors may help?
http://wiki.apache.org/solr/UpdateRequestProcessor

It looks like you can put in custom code to execute when certain
actions happen so sounds like this is what you are looking for.

Cheers
Amit

On Wed, Oct 24, 2012 at 8:43 AM, jefferyyuan yuanyun...@gmail.com wrote:
 When some docs are deleted from Solr server, I want to execute some code -
 for example, add an record such as {contentid, deletedat} to another solr
 server or database.

 How can I do this through Solr or Lucene?

 Thanks for any reply and help :)



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Monitor-Deleted-Event-tp4015624.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Monitor Deleted Event

2012-10-24 Thread Amit Nithian
Since Lucene is a library there isn't much of a support for this since
in theory the client application issuing the delete could also then do
something else upon delete. solr on the other hand being a layer (a
server layer) sitting on top of lucene, it makes sense for hooks to be
configured there.

Since here you can intercept the delete event, you can do what you
wish with it (i.e. in your case maybe send a notification event to
another solr server to add a record).

On Wed, Oct 24, 2012 at 9:19 AM, jefferyyuan yuanyun...@gmail.com wrote:
 Thanks very much :)

 This is what I am looking for.
 And I also wonder whether this some thing as DeleteEvent in Solr or Lucene?

 Is there a way to do this in Lucene? - Not familiar with Lucene yet :)
 As I may choose to do this in lower level...



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Monitor-Deleted-Event-tp4015624p4015641.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Understanding Filter Queries

2012-10-20 Thread Amit Nithian
Hi all,

Quick question. I've been reading up on the filter query and how it's
implemented and the multiple articles I see keep referring to this
notion of leap frogging and filter query execution in parallel with
the main query. Question: Can someone point me to the code that does
this so I can better understand?

Thanks!
Amit


Benchmarking/Performance Testing question

2012-10-19 Thread Amit Nithian
Hi all,

I know there have been many posts about this already and I have done
my best to read through them but one lingering question remains. When
doing performance testing on a Solr instance (under normal production
like circumstances, not the ones where commits are happening more
frequently than necessary), is there any value in performance testing
against a server with caches *disabled* with a profiler hooked up to
see where queries in the absence of a cache are spending the most
time?

The reason I am asking this is to tune things like field types, using
tint vs regular int, different precision steps etc. Or maybe sorting
is taking a long time and the profiler shows an inordinate amount of
time spent there etc. so either we find a different way to solve that
particular problem. Perhaps we are faceting on something bad etc. Then
we can optimize those to at least not be as slow and then ensure that
caching is tuned properly so that cache misses don't yield these
expensive spikes.

I'm trying to devise a proper performance testing for any new
features/config changes and wanted to get some feedback on whether or
not this approach makes sense. Of course performance testing against a
typical production setup *with* caching will also be done to make sure
things behave as expected.

Thanks!
Amit


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Amit Nithian
What about querying on the dynamic lat/long field to see if there are
documents that do not have the dynamic _latlon0 or whatever defined?

On Fri, Oct 19, 2012 at 8:17 AM, darul daru...@gmail.com wrote:
 I have already tried but get a nice exception because of this field type :




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014763.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Amit Nithian
So here is my spec for lat/long (similar to yours except I explicitly
define the sub-field names for clarity)
fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/
field name=location type=latLon indexed=true stored=true/
!-- Could use dynamic fields here but prefer explicitly defining them
so it's clear what's going on. The LatLonType looks to be a wrapper
around these fields? --
field name=location_0_latLon type=tdouble indexed=true stored=true/
field name=location_1_latLon type=tdouble indexed=true stored=true/

So then the query would be location_0_latLon:[ * TO *].

Looking at your schema, my guess would be:
location_0_coordinate:[* TO *]
location_1_coordinate:[* TO *]

Let me know if that helps
Amit

On Fri, Oct 19, 2012 at 9:37 AM, darul daru...@gmail.com wrote:
 Your idea looks great but with this schema info :

  fieldType name=point class=solr.PointType dimension=2
 subFieldSuffix=_d/
 fieldType name=location class=solr.LatLonType
 subFieldSuffix=_coordinate/
 fieldtype name=geohash class=solr.GeoHashField/
 .

 field name=geodata type=location indexed=true stored=true/
 dynamicField name=*_coordinate  type=tdouble indexed=true
 stored=false /

 How can I use it ?

 fq=location_coordinate:[1 to *] not working by instance





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014779.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: maven artifact for solr-solrj-4.0.0

2012-10-18 Thread Amit Nithian
I am not sure if this repository
https://repository.apache.org/content/repositories/releases/ works but
the modification dates seem reasonable given the timing of the
release. I suspect it'll be on maven central soon (hopefully)

On Wed, Oct 17, 2012 at 11:13 PM, Grzegorz Sobczyk
grzegorz.sobc...@contium.pl wrote:
 Hello
 Is there maven artifact for solrj 4.0.0 release ?
 When it will be available to download from http://mvnrepository.com/ ??

 version 4.0.0-BETA isn't compatibile with 4.0.0 (problems with zookeeper and
 clusterstate.json parsing)

 Best regards
 Grzegorz Sobczyk



With Grouping enabled, 0 results yields maxScore of -Infinity

2012-10-15 Thread Amit Nithian
I see that when there are 0 results with the grouping enabled, the max
score is -Infinity which causes parsing problems on my client. Without
grouping enabled the max score is 0.0. Is there any particular reason
for this difference? If not, would there be any resistance to
submitting a patch that will set the score to 0 if the numFound is 0
in the grouping component? I see code that sets the max score to
-Infinity and then will set it to a different value when iterating
over some set of scores. With 0 scores, then it stays as -Infinity and
serializes out as such.

I'll be more than happy to work on this patch but before I do, I
wanted to check that I am not missing something first.

Thanks
Amit


Re: Sum of scores for documents from a query.

2012-10-14 Thread Amit Nithian
Are you looking for the sum of the scores of each document in the
result? In other words, if there were 1000 documents in the numFound
but you only of course show 10 (or 0 depending on rows parameter) you
want the sum of all the scores of 1000 documents in a separate section
of the results?

If so, I have some code and a blog post that I am going to write soon
about it. Shoot me a private note and I'll zip and send to you. I have
it as a separate component.

Thanks
Amit

On Sun, Oct 14, 2012 at 4:47 PM, Erick Erickson erickerick...@gmail.com wrote:
 bq:   is there any way to get a sum of all the scores for a query

 not that I know of. I'm not sure what value this would be anyway,
 what do you want to use it for? This seems like an XY problem...

 Best
 Erick

 On Sun, Oct 14, 2012 at 4:39 PM, Gilles Comeau gilles.com...@polecat.co 
 wrote:
 Hi all,

 Very quick question:   Score is created for each query, however is there any 
 way to get a sum of all the scores for a query in the URL?

 I've tried stats and it didn't work, and also had no luck with function 
 queries.  Does anyone know a way to do this?

 Kind Regards,

 Gilles


Re: Auto Correction?

2012-10-09 Thread Amit Nithian
What's preventing you from using the spell checker and take the #1
result and re-issue the query from a sub-class of the query component?
It should be reasonably fast to re-execute the query from the server
side since you are already within Solr. You can modify the response to
indicate that the new query was used so your client can display to the
user that it searched automatically for milky.. click here for
searches for mlky or something.

On Tue, Oct 9, 2012 at 8:46 AM, Ahmet Arslan iori...@yahoo.com wrote:
 I would like to ask if there are any ways to correct user's
 queries
 automatically? I know there is spellchecker which *suggests*
 possible
 correct words... The thing i wanna do is *automatically
 fixing* those
 queries and running instead of the original one

 not out of the box, you need to re-run suggestions at client side. There is 
 an commercial product though.
 http://sematext.com/products/dym-researcher/index.html


PostFilters, Grouping, Sorting Oh My!

2012-10-09 Thread Amit Nithian
Hi all,

I've been working with using Solr's post filters/delegate collectors
to collect some statistics about the scores of all the documents and
had a few questions with regards to this when combined with grouping
and sorting:
1) I noticed that if I don't include the score field as part of the
sort spec with *no* grouping enabled, my custom delegate scorer gets
called so I can then collect the stats I need. Same is true with score
as part of the sort spec (this then leads me to focus on the grouping
feature)
2) If I turn ON grouping:
  a) WITH score in the sort spec, my custom delegate scorer gets called
  b) WITHOUT score in the sort spec, my custom delegate scorer does
NOT get called.

What's interesting though is that there *are* scores generated so I'm
not sure what all is going on. I traced through the code and saw that
the scorer gets called as part of one of the comparators
(RelevanceComparator) which is why with score in the sort spec it
works but that is about as far as I could go. Since I am not too
worried in my application about a sort spec without the score always
being there it's not a huge concern; however, I do want to understand
why with the grouping feature enabled, this doesn't work and whether
or not it's a bug.

Any help on this would be appreciated so that my solution to this
problem is complete.

Thanks!
Amit


Solr 4.0 and Maven SNAPSHOT artifacts

2012-10-04 Thread Amit Nithian
Is there a maven repository location that contains the nightly build
Maven artifacts of Solr? Are SNAPSHOT releases being generated by
Jenkins or anything so that when I re-resolve the dependencies I'd get
the latest snapshot jars?

Thanks
Amit


Re: Getting list of operators and terms for a query

2012-10-04 Thread Amit Nithian
I think you'd want to start by looking at the rb.getQuery() in the
prepare (or process if you are trying to do post-results analysis).
This returns a Query object that would contain everything in that and
I'd then look at the Javadoc to see how to traverse it. I'm sure some
runtime type-casting may be necessary to get at the sub-structures

On Thu, Oct 4, 2012 at 9:23 AM, Davide Lorenzo Marino
davide.mar...@gmail.com wrote:
 I don't need really start from the query String.
 What I need is obtain a list of terms and operators.
 So the real problem is:

 How can I access the Lucene Query structure to traverse it?

 Davide Marino


 2012/10/4 Jack Krupansky j...@basetechnology.com

 I'm not quite following what the issue is here. I mean, the Solr
 QueryComponent generates a Lucene Query structure and you need to write
 code to recursively traverse that Lucene Query structure and generate your
 preferred form of output. There would be no need to look at the original
 query string. So, what exactly are you asking?

 Maybe you simply need to read up on Lucene Query and its subclasses to
 understand what that structure looks like.

 -- Jack Krupansky

 -Original Message- From: Davide Lorenzo Marino
 Sent: Thursday, October 04, 2012 11:36 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Getting list of operators and terms for a query


 It's ok.. I did it and I took the query string.
 The problem is convert the java.lang.string (query) in a list of term and
 operators and doing it using the same parser used by Solr to execute the
 queries.

 2012/10/4 Mikhail Khludnev mkhlud...@griddynamics.com

  you've got ResponseBuilder as process() or prepare() argument, check
 query field, but your component should be registered after
 QueryComponent
 in your requestHandler config.

 On Thu, Oct 4, 2012 at 6:03 PM, Davide Lorenzo Marino 
 davide.mar...@gmail.com wrote:

  Hi All,
  i'm working in a new searchComponent that analyze the search queries.
  I need to know if given a query string is possible to get the list of
  operators and terms (better in polish notation)?
  I mean if the default field is country and the query is the String
 
  england OR (name:paul AND city:rome)
 
  to get the List
 
  [ Operator OR, Term country:england, OPERATOR AND, Term name:paul, Term
  city:rome ]
 
  Thanks in advance
 
  Davide Marino
 



 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com





Re: Getting list of operators and terms for a query

2012-10-04 Thread Amit Nithian
I'm not 100% sure but my guess is that you can get the list of boolean
clauses and their occur (must, should, must not) and that would be
your and, or, not equivalents.



On Thu, Oct 4, 2012 at 10:39 AM, Davide Lorenzo Marino
davide.mar...@gmail.com wrote:
 For what I saw in the documentation from the class
 org.apache.lucene.search.Query
 I can just iterate over the terms using the method extractTerms. How can I
 extract the operators?

 2012/10/4 Amit Nithian anith...@gmail.com

 I think you'd want to start by looking at the rb.getQuery() in the
 prepare (or process if you are trying to do post-results analysis).
 This returns a Query object that would contain everything in that and
 I'd then look at the Javadoc to see how to traverse it. I'm sure some
 runtime type-casting may be necessary to get at the sub-structures

 On Thu, Oct 4, 2012 at 9:23 AM, Davide Lorenzo Marino
 davide.mar...@gmail.com wrote:
  I don't need really start from the query String.
  What I need is obtain a list of terms and operators.
  So the real problem is:
 
  How can I access the Lucene Query structure to traverse it?
 
  Davide Marino
 
 
  2012/10/4 Jack Krupansky j...@basetechnology.com
 
  I'm not quite following what the issue is here. I mean, the Solr
  QueryComponent generates a Lucene Query structure and you need to write
  code to recursively traverse that Lucene Query structure and generate
 your
  preferred form of output. There would be no need to look at the original
  query string. So, what exactly are you asking?
 
  Maybe you simply need to read up on Lucene Query and its subclasses to
  understand what that structure looks like.
 
  -- Jack Krupansky
 
  -Original Message- From: Davide Lorenzo Marino
  Sent: Thursday, October 04, 2012 11:36 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Getting list of operators and terms for a query
 
 
  It's ok.. I did it and I took the query string.
  The problem is convert the java.lang.string (query) in a list of term
 and
  operators and doing it using the same parser used by Solr to execute the
  queries.
 
  2012/10/4 Mikhail Khludnev mkhlud...@griddynamics.com
 
   you've got ResponseBuilder as process() or prepare() argument, check
  query field, but your component should be registered after
  QueryComponent
  in your requestHandler config.
 
  On Thu, Oct 4, 2012 at 6:03 PM, Davide Lorenzo Marino 
  davide.mar...@gmail.com wrote:
 
   Hi All,
   i'm working in a new searchComponent that analyze the search queries.
   I need to know if given a query string is possible to get the list of
   operators and terms (better in polish notation)?
   I mean if the default field is country and the query is the String
  
   england OR (name:paul AND city:rome)
  
   to get the List
  
   [ Operator OR, Term country:england, OPERATOR AND, Term name:paul,
 Term
   city:rome ]
  
   Thanks in advance
  
   Davide Marino
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 
 
 



Re: Query filtering

2012-09-27 Thread Amit Nithian
I think one way to do this is issue another query and set a bunch of
filter queries to restrict interesting_facet to just those ten
values returned in the first query.

fq=interesting_facet:1 OR interesting_facet:2 etcq=context:whatever

Does that help?
Amit

On Thu, Sep 27, 2012 at 6:33 AM, Finotti Simone tech...@yoox.com wrote:
 Hello,
 I'm doing this query to return top 10 facets within a given context, 
 specified via the fq parameter.

 http://solr/core/select?fq=(...)q=*:*rows=0facet.field=interesting_facetfacet.limit=10

 Now, I should search for a term inside the context AND the previously 
 identified top 10 facet values.

 Is there a way to do this with a single query?

 thank you in advance,
 S


Re: Getting the distribution information of scores from query

2012-09-27 Thread Amit Nithian
Thanks! That did the trick! Although it required some more work in the
component level of generating the same query key as the index searcher
else when you go to try and fetch scores for a cached query result, I
got a lot of NPE since the stats are computed in the collector level
which for me isn't set since the cache hit bypasses the lucene level.
I'll write up what I did and probably try and open source the work for
others to see. The stuff with PostFiltering is nice but needs some
examples and documentation.. hopefully mine will help the cause.

Thanks again
Amit

On Wed, Sep 26, 2012 at 5:13 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 I suggest to create a component, put it after QueryComponent. in prepare it
 should add own PostFilter into list of request filters, your post filter
 will be able to inject own DelegatingCollector, then you can just add
 collected histogram into result named list
  http://searchhub.org/dev/2012/02/10/advanced-filter-caching-in-solr/

 On Tue, Sep 25, 2012 at 10:03 PM, Amit Nithian anith...@gmail.com wrote:

 We have a federated search product that issues multiple parallel
 queries to solr cores and fetches the results and blends them. The
 approach we were investigating was taking the scores, normalizing them
 based on some distribution (normal distribution seems reasonable) and
 use that z score as the way to blend the results (else you'll be
 blending scores on different scales). To accomplish this, I was
 looking to get the distribution of the scores for the query as an
 analog to the stats component but seem to see the only way to
 accomplish this would be to create a custom collector that would
 accumulate and store this information (mean, std-dev etc) since the
 stats component only operates on indexed fields.

 Is there an easy way to tell Solr to use a custom collector without
 having to modify the SolrIndexSearcher class? Maybe is there an
 alternative way to get this information?

 Thanks
 Amit




 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com


Re: AutoIndexing

2012-09-25 Thread Amit Nithian
There's a couple ways to accomplish this from easy to hard depending
on your database schema:
1) Use DB trigger
  - I don't like triggers too much b/c to me they couple your
database layer with your application layer which leads to untestable
and sometimes unmaintainable code
  - Also it gets difficult when you want to re-index a document based
on a change to an auxiliary table. Say you associate an image with the
main entity, you're not touching the main entity table so you then can
have triggers on a bunch of tables which could get messy?
2) Use a database table as a queue of records to index and write to
it from your application when the main entity changes
   - This isn't too bad.. it's a replayable queue basically that you
can purge when you want, query to find all the main entities that
changed and construct your SQL queries accordingly to submit documents
for indexing
3) Use a real message queue and some receiver that will index the document
   - This could be the best but also most complicated solution.. when
your application changes an entity, a message is sent on the queue
either with the actual document itself or maybe an ID where you can
re-construct the document for indexing.

There are probably other solutions too but those are the 3 that come
to my mind off hand and where I work, we use #2 with incremental index
processes that check for changes since some last known time and
indexes.

- Amit

On Tue, Sep 25, 2012 at 3:37 AM, Tom Mortimer tom.m.f...@gmail.com wrote:
 I'm afraid I don't have any DIH experience myself, but some googling suggests 
 that using a postgresql trigger to start a delta import might be one approach:

 http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command  and
 http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

 Tom

 On 25 Sep 2012, at 11:28, darshan dk...@dreamsoftech.com wrote:

 My Document is Database(yes RDBMS) and software for it is postgresql, where
 any change in it's table should be reflected, without re-indexing. I am
 indexing it via DIH process
 Thanks,
 Darshan

 -Original Message-
 From: Tom Mortimer [mailto:tom.m.f...@gmail.com]
 Sent: Tuesday, September 25, 2012 3:31 PM
 To: solr-user@lucene.apache.org
 Subject: Re: AutoIndexing

 Hi Darshan,

 Can you give us some more details, e.g. what do you mean by database? A
 RDBMS? Which software? How are you indexing it (or intending to index it) to
 Solr? etc...

 cheers,
 Tom


 On 25 Sep 2012, at 09:55, darshan dk...@dreamsoftech.com wrote:

 Hi All,

   Is there any way where I can auto-index whenever there
 is changes in my database.

 Thanks,

 Darshan






Prevent Log and other math functions from returning Infinity and erroring out

2012-09-20 Thread Amit Nithian
Is there any reason why the log function shouldn't be modified to
always take 1+the number being requested to be log'ed? Reason I ask is
I am taking the log of the value output by another function which
could return 0. For testing, I modified it to return 1 which works but
would rather have the log function simply add 1.

Of course I could do something like log(sum(...)) but that seems a bit
much OR just create my own modified log function in my code but was
wondering if there would be any objections to filing an issue and
patch to fix math functions like this from returning infinity?

Thanks
Amit


Re: Is it possible to do an if statement in a Solr query?

2012-09-12 Thread Amit Nithian
If the fact that it's original vs generic is a field is_original
0/1 can you sort by is_original? Similarly, could you put a huge boost
on is_original in the dismax so that document matches on is_original
score higher than those that aren't original? Or is your goal to not
show generics *at all*?


On Wed, Sep 12, 2012 at 2:47 PM, Walter Underwood wun...@wunderwood.org wrote:
 You may be able to do this with grouping. Group on the medicine family, and 
 only show the Original if there are multiple items in the family.

 wunder

 On Sep 12, 2012, at 2:09 PM, Gustav wrote:

 Hello everyone, I'm working on an e-commerce website and using Solr as my
 Search Engine, im really enjoying its funcionality and the search
 options/performance.
 But i am stucky in a kinda tricky cenario... That what happens:

 I Have  a medicine web-store, where i indexed all necessary products in my
 Index Solr.
 But when i search for some medicine, following my business rules, i have to
 verify if the result of my search contains any Original medicine, if there
 is any, then i wouldn't show the generics of this respective medicine, on
 the other hand, if there wasnt any original product in the result i would
 have to return its generics.
 Im currently returning the original and generics, is there a way to do this
 kind of checking in solr?

 Thanks! :)







Re: XInclude Multiple Elements

2012-09-11 Thread Amit Nithian
Way back when I opened an issue about using XML entity includes in
Solr as a way to break up the config. I have found problems with
XInclude having multiple elements to include because the file is not
well formed. From what I have read, if you make this well formed, you
end up with a document that's not what you expect.

For example:
my schema.xml has
fields
...
xinclude href=more_fields.xml .../
/fields

more_fields.xml
field name=..

which isn't well formed. You could make it well formed:
fields
field name =..
/fields
but then I think you end up with nested fields element which doesn't
work (and btw I still keep getting the blasted failed to parse error
which isn't very helpful). Looking at this made me wonder if entity
includes work with Solr 4 and indeed they do! They aren't as flexible
as XIncludes but for the purpose of breaking up an XML file into
smaller pieces, it works beautifully and as you would expect.

You can simply declare your entities at the top as shown in the
earlier thread and then include them where you need. I've been using
this for years and it works fairly well.

Cheers!
Amit


On Thu, May 31, 2012 at 7:01 AM, Bogdan Nicolau bogdan@gmail.com wrote:
 I've also tried a lot of tricks to get xpointer working with multiple child
 elements, to no success.
 In the end, I've resorted to a less pretty, other-way-around solution. I do
 something like this:
 solrconfig_common.xml - no xml declaration, no root tag, no nothing
 etc/etc
 etc2/etc2
 ...
 For each file that I need the common stuff into, I'd do something like this:
 solrconfig_master.xml/solrconfig_slave.xml/etc.
 ?xml version=1.0 encoding=UTF-8 ?
 !DOCTYPE config [
 lt;!ENTITY solrconfigcommon SYSTEM
 quot;solrconfig_common.xmlquot;
 ]

 config
 solrconfigcommon;

 /config

 Solr starts with 0 warnings, the configuration is properly loaded, etc.
 Property substitution also works, including inside the
 solrconfig_common.xml. Hope it helps anyone.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/XInclude-Multiple-Elements-tp3167658p3987029.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replication policy

2012-09-11 Thread Amit Nithian
If I understand you right,  replication of data has 0 downtime, it
just works and the data flows through from master to slaves. If you
want, you can configure the replication to replicate configuration
files across the cluster (although to me my deploy script does this).
I'd recommend tweaking the warmers so that you don't get latency
spikes due to cold caches during the replications.

Not being well versed in the latest Solr features (I'm a bit behind
here), I don't know if you can reload the cores on demand to indicate
the latest configurations or not but in my environment, I have a
rolling restart script that bounces a set of servers when the
schema/solrconfig changes.

HTH
Amit

On Mon, Sep 10, 2012 at 11:10 PM, Abhishek tiwari
abhishek.tiwari@gmail.com wrote:
 HI All,

  am having 1 master and 3 slave solr server.(verson 3.6)
  What kind of replication policy should i adopt with zero down time  no
 data loss .

 1) when we do some configuration and schema  changes on the solr server .


Re: solr.StrField with stored=true useless or bad?

2012-09-11 Thread Amit Nithian
This is great thanks for this post! I was curious about the same thing
and was wondering why fl couldn't return the indexed
representation of a field if that field were only indexed but not
stored. My thoughts were return something than nothing but I didn't
pay attention to the fact that getting even the indexed
representation of a field given a document is not fast.

Thanks
Amit

On Tue, Sep 11, 2012 at 4:03 PM,  sy...@web.de wrote:
 Hi,

 I have a StrField to store an URL. The field definition looks like this:
 field name=link type=string indexed=true stored=true required=true 
 /

 Type string is defined as usual:
 fieldType name=string class=solr.StrField sortMissingLast=true /

 Then I realized that a StrField doesn't execute any analyzers and stored data 
 verbatim. The data is just a single token.

 The purpose of stored=true is to store the raw string data besides the 
 analyzed/transformed data for displaying purposes. This is fine for an 
 analyzed solr.TextField, but for an StrField both values are the same. So is 
 there any reason to apply stored=true on a StrField as well?

 I ask, because I found a lot of sites and tutorials applying stored=true on 
 StrFields as well. Do they all to it wrong or am I missing something here?


Re: Solr - Lucene Debuging help

2012-09-11 Thread Amit Nithian
The wiki should probably be updated.. maybe I'll take a stab at it.
I'll also try and update my article referenced there too.

When you checkout the project from SVN, do ant eclipse

Look at this bug (https://issues.apache.org/jira/browse/SOLR-3817) and
either run the ruby program or download the patch and apply but either
way it should fix the classpath issues.

Then import the project and you can follow the remainder of the steps
in the 
http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse
article.

Cheers
Amit

On Mon, Sep 10, 2012 at 1:29 PM, BadalChhatbar badal...@yahoo.com wrote:
 Hi Steve,

 Thanks, I was able to create new project using that url. :)

 one more thing,.. its giving me about 32K error. (something like.. this type
 cannot be resolved).

 i tried rebuilding project and running ant command (build.xml) . but it
 didn't help. any suggestions on this ?


 thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Lucene-Debuging-help-tp4006715p4006721.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: In-memory indexing

2012-09-11 Thread Amit Nithian
I have wondered about this too but instead why not just set your cache
sizes large enough to house most/all of your documents and pre-warm
the caches accordingly? My bet is that a large enough document cache
may suffice but that's just a guess.

- Amit

On Mon, Sep 10, 2012 at 10:56 AM, Kiran Jayakumar
kiranjuni...@gmail.com wrote:
 Hi,

 Does anyone have any experience in hosting the entire index in a RAM disk ?
 (I'm not thinking about Lucene's RAM directory). I have some small indexes
 (less than a Gb). Also, please recommend a good RAM disk application for
 Windows (I have used Gizmo, wondering if there's any better one out there).

 Thanks
 Kiran


Re: Trouble Setting Up Development Environment

2012-09-10 Thread Amit Nithian
Sorry i'm really late to this so not sure if this is even an issue:
1) I found that there is an ant eclipse that makes it easy to setup
the eclipse .project and .classpath (I think I had done this by hand
in the tutorial)
2) Yes you can attach to a remote instance of Solr but your JVM has to
have the remote debug options and port setup. Eclipse can connect
fairly easily to this in the debug configuration menu.

Thanks
Amit

On Mon, Mar 26, 2012 at 4:13 AM, Erick Erickson erickerick...@gmail.com wrote:
 Depending upon what you actually need to do, you could consider just
 attaching to the running Solr instance remotely. I know it's easy in
 IntelliJ, and believe Eclipse makes this easy as well but I haven't
 used Eclipse in a while

 Best
 Erick

 On Sat, Mar 24, 2012 at 11:11 PM, Li Li fancye...@gmail.com wrote:
 I forgot to write that I am running it in tomcat 6, not jetty.
 you can right click the project - Debug As - Debug on Server - Manually
 define a new Server - Apache - Tomcat 6
 if you should have configured a tomcat.

 On Sun, Mar 25, 2012 at 4:17 AM, Karthick Duraisamy Soundararaj 
 karthick.soundara...@gmail.com wrote:

 I followed your instructions. I got 8 Errors and a bunch of warnings few
 of them related to classpath. I also got the following exception when I
 tried to run with the jetty ( i have attached the full console output with
 this email. I figured solr directory with config files might be missing and
 added that in WebContent.

 Could be of great help if someone can point me at right direction.

 ls WebContent
 admin  favicon.ico  index.jsp  solr  WEB-INF


 *SEVERE: Error in solrconfig.xml:org.apache.solr.common.SolrException: No
 system property or default value specified for solr.test.sys.prop1*
 at
 org.apache.solr.common.util.DOMUtil.substituteProperty(DOMUtil.java:331)
 at
 org.apache.solr.common.util.DOMUtil.substituteProperties(DOMUtil.java:290)
 at
 org.apache.solr.common.util.DOMUtil.substituteProperties(DOMUtil.java:292)
 at org.apache.solr.core.Config.init(Config.java:165)
 at org.apache.solr.core.SolrConfig.init(SolrConfig.java:131)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:435)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
 at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:133)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
 at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
 at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
 at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
 at
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
 at
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
 at
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
 at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
 at org.mortbay.jetty.Server.doStart(Server.java:224)
 at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at runjettyrun.Bootstrap.main(Bootstrap.java:97)


 *Here are the 8 errors I got*
 *Description
  Resource
Path

   Location   Type*
 core cannot be resolved
 dataimport.jsp
  /solr3_5/ssrc/solr/contrib/dataimporthandler/src/webapp/admin
  line 27   JSP Problem
 End tag (/html) not closed properly, expected .package.html
 /solr3_5/ssrc/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/core/config
  line 64HTML Problem
 Fragment_info.jsp was not found at expected
 path  /solr3_5/ssrc/solr/contrib/
 dataimporthandler/src/webapp/admin/_info.jspdataimport.jsp
 /solr3_5/ssrc/solr/contrib/dataimporthandler/src/webapp/admin
 line 21JSP Problem
 Fragment _info.jsp was not found at expected
 path /solr3_5/ssrc/solr/contrib/dataimporthandler
 /src/webapp/admin/_info.jsp
 debug.jsp
 /solr3_5/ssrc/solr/contrib/dataimporthandler/src/webapp/admin
 line 19JSP Problem
 Named template dotdots is not available tabutils.xsl

 /solr3_5/ssrc/lucene/src/site/src/documentation/skins/common/xslt/html
   line 41XSL Problem
 Named template dotdots is not available tabutils.xsl
 /solr3_5/ssrc/solr/site-src/src/documentation/skins/common/xslt/html
   line 41XSL Problem
 Unhandled exception type Throwable  ping.jsp
 

Re: N-gram ranking based on term position

2012-09-07 Thread Amit Nithian
I think your thought about using the edge ngram as a field and
boosting that field in the qf/pf sections of the dismax handler sounds
reasonable. Why do you have qualms about it?

On Fri, Sep 7, 2012 at 12:28 PM, Kiran Jayakumar kiranjuni...@gmail.com wrote:
 Hi,

 Is it possible to score documents with a match early in the text higher
 than later in the text ? I want to boost begin with matches higher than
 the contains matches. I can define a copy field and analyze it as edge
 n-gram and boost it. I was wondering if there was a better way to do it.

 Thanks


Re: Running out of memory

2012-08-16 Thread Amit Nithian
I am debugging an out of memory error myself and a few suggestions:
1) Are you looking at your search logs around the time of the memory
error? In my case, I found a few bad queries requesting a ton of rows
(basically the whole index's worth which I think is an error somewhere
in our app just have to find it) which happened close to the OOM error
being thrown.
2) Do you have Solr hooked up to something like NewRelic/AppDynamics
to see the cache usage in real time? Maybe as was suggested, tuning
down or eliminating low used caches could help.
3) Are you ensuring that you aren't setting stored=true on fields
that don't need it? This will increase the index size and possibly the
cache size if lazy loading isn't enabled (to be honest, this part I am
a bit unclear of since I haven't had much experience with this
myself).

Thanks
Amit

On Mon, Aug 13, 2012 at 11:37 AM, Jon Drukman jdruk...@gmail.com wrote:
 On Sun, Aug 12, 2012 at 12:31 PM, Alexey Serba ase...@gmail.com wrote:

  It would be vastly preferable if Solr could just exit when it gets a
 memory
  error, because we have it running under daemontools, and that would cause
  an automatic restart.
 -XX:OnOutOfMemoryError=cmd args; cmd args
 Run user-defined commands when an OutOfMemoryError is first thrown.

  Does Solr require the entire index to fit in memory at all times?
 No.

 But it's hard to say about your particular problem without additional
 information. How often do you commit? Do you use faceting? Do you sort
 by Solr fields and if yes what are those fields? And you should also
 check caches.


 I upgraded to solr-3.6.1 and an extra large amazon instance (15GB RAM) so
 we'll see if that helps.  So far no out of memory errors.


Re: Nrt and caching

2012-07-07 Thread Amit Nithian
Thanks for the responses. I guess my specific question is if I had
something which was dependent on the mapping between lucene document ids
and some object primary key so i could pull in external data from another
data source without a constant reindex, how would this get affected by soft
and hard commits? I'd prefer not to have to rebuild this mapping from
scratch on each soft or even hard commits if possible since those seem to
happen frequently.

Also can you explain why and how per segment caches are used and how at the
client of lucene layer one gets access or knows about this? I always
thought segments were an implementation detail where they get merged on
optimize etc so wouldn't that affect clients depending on segment level
stuff? Or what am I missing?

Thanks again!
Amit
On Jul 7, 2012 9:22 AM, Andy angelf...@yahoo.com wrote:

 So If I want to use multi-value facet with NRT I'd need to convert the
 cache to per-segment? How do I do that?

 Thanks.


 
  From: Jason Rutherglen jason.rutherg...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, July 7, 2012 11:32 AM
 Subject: Re: Nrt and caching

 The field caches are per-segment, which are used for sorting and basic
 [slower] facets.  The result set, document, filter, and multi-value facet
 caches are [in Solr] per-multi-segment.

 Of these, the document, filter, and multi-value facet caches could be
 converted to be [performant] per-segment, as with some other Apache
 licensed Lucene based search engines.

 On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley yo...@lucidimagination.com
 wrote:

  On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
   Currently the caches are stored per-multiple-segments, meaning after
 each
   'soft' commit, the cache(s) will be purged.
 
  Depends which caches.  Some caches are per-segment, and some caches
  are top level.
  It's also a trade-off... for some things, per-segment data structures
  would indeed turn around quicker on a reopen, but every query would be
  slower for it.
 
  -Yonik
  http://lucidimagination.com
 


Nrt and caching

2012-07-06 Thread Amit Nithian
Sorry I'm a bit new to the nrt stuff in solr but I'm trying to understand
the implications of frequent commits and cache rebuilding and auto warming.
What are the best practices surrounding nrt searching and caches and query
performance.

Thanks!
Amit


Re: How to improve this solr query?

2012-07-04 Thread Amit Nithian
Couple questions:
1) Why are you explicitly telling solr to sort by score desc,
shouldn't it do that for you? Could this be a source of performance
problems since sorting requires the loading of the field caches?
2) Of the query parameters, q1 and q2, which one is actually doing
text searching on your index? It looks like q1 is doing non-string
related stuff, could this be better handled in either the bf or bq
section of the edismax config? Looking at the sample though I don't
understand how q1=apartment would hit non-string fields again (but see
#3)
3) Are the string fields literally of string type (i.e. no analysis
on the field) or are you saying string loosely to mean text field.
pf == phrase fields == given a multiple word query, will ensure that
the specified phrase exists in the specified fields separated by some
slop (hello my world may match hello world depending on this slop
value). The qf means that given a multi term query, each term exists
in the specified fields (name, description whatever text fields you
want).

Best
Amit

On Mon, Jul 2, 2012 at 9:35 AM, Chamnap Chhorn chamnapchh...@gmail.com wrote:
 Hi all,

 I'm using solr 3.5 with nested query on the 4 core cpu server + 17 Gb. The
 problem is that my query is so slow; the average response time is 12 secs
 against 13 millions documents.

 What I am doing is to send quoted string (q2) to string fields and
 non-quoted string (q1) to other fields and combine the result together.

 facet=truesort=score+descq2=*apartment*facet.mincount=1q1=*apartment*
 tie=0.1q.alt=*:*wt=jsonversion=2.2rows=20fl=uuidfacet.query=has_map:+truefacet.query=has_image:+truefacet.query=has_website:+truestart=0q=
 *
 _query_:+{!dismax+qf='.'+fq='..'+v=$q1}+OR+_query_:+{!dismax+qf='..'+fq='...'+v=$q2}
 *
 facet.field={!ex%3Ddt}sub_category_uuidsfacet.field={!ex%3Ddt}location_uuid

 I have done solr optimize already, but it's still slow. Any idea how to
 improve the speed? Am I done anything wrong?

 --
 Chhorn Chamnap
 http://chamnap.github.com/


Use of Solr as primary store for search engine

2012-07-04 Thread Amit Nithian
Hello all,

I am curious to know how people are using Solr in conjunction with
other data stores when building search engines to power web sites (say
an ecommerce site). The question I have for the group is given an
architecture where the primary (transactional) data store is MySQL
(Oracle, PostGres whatever) with periodic indexing into Solr, when
your front end issues a search query to Solr and returns results, are
there any joins with your primary Oracle/MySQL etc to help render
results?

Basically I guess my question is whether or not you store enough in
Solr so that when your front end renders the results page, it never
has to hit the database. The other option is that your search engine
only returns primary keys that your front end then uses to hit the DB
to fetch data to display to your end user.

With Solr 4.0 and Solr moving towards the NoSQL direction, I am
curious what people are doing and what application architectures with
Solr look like.

Thanks!
Amit


Re: Something like 'bf' or 'bq' with MoreLikeThis

2012-07-04 Thread Amit Nithian
No worries! What version of Solr are you using? One that you
downloaded as a tarball or one that you checked out from SVN (trunk)?
I'll take a bit of time and document steps and respond.

I'll review the patch to see that it fits a general case. Question for
you with MLT, are your users doing a blank search (no text) for
something or are you returning results More Like results that were
generated as a result of a user typing some text query. I may have
built this patch assuming a blank query but I can make it work (or try
to) make it work for text based queries.

Thanks
Amit

On Wed, Jul 4, 2012 at 1:37 AM, nanshi nanshi.e...@gmail.com wrote:
 Thanks a lot, Amit! Please bear with me, I am a new Solr dev, could you
 please shed me some light on how to use a patch? point me to a wiki/doc is
 fine too. Thanks a lot! :)

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Something-like-bf-or-bq-with-MoreLikeThis-tp3989060p3992935.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Use of Solr as primary store for search engine

2012-07-04 Thread Amit Nithian
Paul,

Thanks for your response! Were you using the SQL database as an object
store to pull XWiki objects or did you have to execute several queries
to reconstruct these objects? I don't know much about them sorry..
Also for those responding, can you provide a few basic metrics for me?
1) Number of nodes receiving queries
2) Approximate queries per second
3) Approximate latency per query

I know some of this may be sensitive depending on where you work so
reasonable ranges would be nice (i.e. sub-second isn't hugely helpful
since 50,100,200 ms have huge impacts depending on your site).

Thanks again!
Amit

On Wed, Jul 4, 2012 at 1:09 AM, Paul Libbrecht p...@hoplahup.net wrote:
 Amit,

 not exactly a response to your question but doing this with a lucene index on 
 i2geo.net has resulted in considerably performance boost (reading from 
 stored-fields instead of reading from the xwiki objects which pull from the 
 SQL database). However, it implied that we had to rewrite anything necessary 
 for the rendering, hence the rendering has not re-used that many code.

 Paul


 Le 4 juil. 2012 à 09:54, Amit Nithian a écrit :

 Hello all,

 I am curious to know how people are using Solr in conjunction with
 other data stores when building search engines to power web sites (say
 an ecommerce site). The question I have for the group is given an
 architecture where the primary (transactional) data store is MySQL
 (Oracle, PostGres whatever) with periodic indexing into Solr, when
 your front end issues a search query to Solr and returns results, are
 there any joins with your primary Oracle/MySQL etc to help render
 results?

 Basically I guess my question is whether or not you store enough in
 Solr so that when your front end renders the results page, it never
 has to hit the database. The other option is that your search engine
 only returns primary keys that your front end then uses to hit the DB
 to fetch data to display to your end user.

 With Solr 4.0 and Solr moving towards the NoSQL direction, I am
 curious what people are doing and what application architectures with
 Solr look like.

 Thanks!
 Amit



Re: Something like 'bf' or 'bq' with MoreLikeThis

2012-07-03 Thread Amit Nithian
I had a similar problem so I submitted this patch:
https://issues.apache.org/jira/browse/SOLR-2351

I haven't applied this to trunk in a while but my goal was to ensure
that bf parameters were passed down and respected by the MLT handler.
Let me know if this works for you or not. If there is sufficient
interest, I'll re-apply this patch to trunk and try and devise some
tests.

Thanks!
Amit

On Tue, Jul 3, 2012 at 5:08 PM, nanshi nanshi.e...@gmail.com wrote:
 Jack, can you please explain this in some more detail? Such as how to write
 my own search component to modify request to add bq parameter and get
 customized result back?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Something-like-bf-or-bq-with-MoreLikeThis-tp3989060p3992888.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: difference between stored=false and stored=true ?

2012-07-03 Thread Amit Nithian
So couple questions on this (comment first then question):
1) I guess you can't have four combinations b/c
index=false/stored=false has no meaning?
2) If you set less fields stored=true does this reduce the memory
footprint for the document cache? Or better yet, I can store more
documents in the cache possibly increasing my cache efficiency?

I read about the lazy loading of fields which seems like a good way to
maximize the cache and gain the advantage of storing data in Solr too.

Thanks
Amit

On Sat, Jun 30, 2012 at 11:01 AM, Giovanni Gherdovich
g.gherdov...@gmail.com wrote:
 Thank you François and Jack for those explainations.

 Cheers,
 GGhh

 2012/6/30 François Schiettecatte:
 Giovanni

 stored=true means the data is stored in the index and [...]


 2012/6/30 Jack Krupansky:
 indexed and stored are independent [...]


Re: Editing long Solr URLs - Chrome Extension

2012-05-19 Thread Amit Nithian
All,

I have placed a new version of the extension (suffixed _0.3) at
https://github.com/ANithian/url_edit_extension/downloads. A few of the
bugs resolved:
1) Switching to a tab in a new window and clicking on the extension
wasn't loading the right URL
2) Complex SOLR URLs (ironic as this was the purpose) weren't being
handled properly. I had to ditch the 3rd party URL parser in favor of
my own which should better handle these complex parameters.
3) Replaced the edit box of the parameter value from a single line
textbox to a multiple line textarea. This doesn't solve the tab to
edit the next row but it helps a bit in that problem.

Please keep submitting issues as you encounter them and I'll address
them as best as possible. I hope that this helps everyone!

Thanks!
Amit

On Tue, May 15, 2012 at 6:20 PM, Amit Nithian anith...@gmail.com wrote:
 Erick

 Yes thanks I did see that and am working on a solution to that already. Hope
 to post a new revision shortly and eventually migrate to the extension
 store.

 Cheers
 Amit

 On May 15, 2012 9:20 AM, Erick Erickson erickerick...@gmail.com wrote:

 I think I put one up already, but in case I messed up github, complex
 params like the fq here:

 http://localhost:8983/solr/select?q=:fq={!geofilt sfield=store
 pt=52.67,7.30 d=5}

 aren't properly handled.

 But I'm already using it occasionally

 Erick

 On Tue, May 15, 2012 at 10:02 AM, Amit Nithian anith...@gmail.com wrote:
  Jan
 
  Thanks for your feedback! If possible can you file these requests on the
  github page for the extension so I can work on them? They sound like
  great
  ideas and I'll try to incorporate all of them in future releases.
 
  Thanks
  Amit
  On May 11, 2012 9:57 AM, Jan Høydahl j...@hoydahl.no wrote:
 
  I've been testing
 
  https://chrome.google.com/webstore/detail/mbnigpeabbgkmbcbhkkbnlidcobbapff?hl=enbut
  I don't think it's great.
 
  Great work on this one. Simple and straight forward. A few wishes:
  * Sticky mode? This tool would make sense in a sidebar, to do rapid
  refinements
  * If you edit a value and click TAB, it is not updated :(
  * It should not be necessary to URLencode all non-ascii chars - why not
  leave colon, caret (^) etc as is, for better readability?
  * Some param values in Solr may be large, such as fl, qf or bf.
  Would be nice if the edit box was multi-line, or perhaps adjusts to the
  size of the content
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.facebook.com/Cominvent
  Solr Training - www.solrtraining.com
 
  On 11. mai 2012, at 07:32, Amit Nithian wrote:
 
   Hey all,
  
   I don't know about you but most of the Solr URLs I issue are fairly
   lengthy full of parameters on the query string and browser location
   bars aren't long enough/have multi-line capabilities. I tried to find
   something that does this but couldn't so I wrote a chrome extension
   to
   help.
  
   Please check out my blog post on the subject and please let me know
   if
   something doesn't work or needs improvement. Of course this can work
   for any URL with a query string but my motivation was to help edit my
   long Solr URLs.
  
  
 
  http://hokiesuns.blogspot.com/2012/05/manipulating-urls-with-long-query.html
  
   Thanks!
   Amit
 
 


Re: Editing long Solr URLs - Chrome Extension

2012-05-15 Thread Amit Nithian
Jan

Thanks for your feedback! If possible can you file these requests on the
github page for the extension so I can work on them? They sound like great
ideas and I'll try to incorporate all of them in future releases.

Thanks
Amit
On May 11, 2012 9:57 AM, Jan Høydahl j...@hoydahl.no wrote:

 I've been testing
 https://chrome.google.com/webstore/detail/mbnigpeabbgkmbcbhkkbnlidcobbapff?hl=enbut
  I don't think it's great.

 Great work on this one. Simple and straight forward. A few wishes:
 * Sticky mode? This tool would make sense in a sidebar, to do rapid
 refinements
 * If you edit a value and click TAB, it is not updated :(
 * It should not be necessary to URLencode all non-ascii chars - why not
 leave colon, caret (^) etc as is, for better readability?
 * Some param values in Solr may be large, such as fl, qf or bf.
 Would be nice if the edit box was multi-line, or perhaps adjusts to the
 size of the content

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.facebook.com/Cominvent
 Solr Training - www.solrtraining.com

 On 11. mai 2012, at 07:32, Amit Nithian wrote:

  Hey all,
 
  I don't know about you but most of the Solr URLs I issue are fairly
  lengthy full of parameters on the query string and browser location
  bars aren't long enough/have multi-line capabilities. I tried to find
  something that does this but couldn't so I wrote a chrome extension to
  help.
 
  Please check out my blog post on the subject and please let me know if
  something doesn't work or needs improvement. Of course this can work
  for any URL with a query string but my motivation was to help edit my
  long Solr URLs.
 
 
 http://hokiesuns.blogspot.com/2012/05/manipulating-urls-with-long-query.html
 
  Thanks!
  Amit




Re: Editing long Solr URLs - Chrome Extension

2012-05-15 Thread Amit Nithian
Erick

Yes thanks I did see that and am working on a solution to that already.
Hope to post a new revision shortly and eventually migrate to the extension
store.

Cheers
Amit
On May 15, 2012 9:20 AM, Erick Erickson erickerick...@gmail.com wrote:

 I think I put one up already, but in case I messed up github, complex
 params like the fq here:

 http://localhost:8983/solr/select?q=:fq={!geofilt sfield=store
 pt=52.67,7.30 d=5}

 aren't properly handled.

 But I'm already using it occasionally

 Erick

 On Tue, May 15, 2012 at 10:02 AM, Amit Nithian anith...@gmail.com wrote:
  Jan
 
  Thanks for your feedback! If possible can you file these requests on the
  github page for the extension so I can work on them? They sound like
 great
  ideas and I'll try to incorporate all of them in future releases.
 
  Thanks
  Amit
  On May 11, 2012 9:57 AM, Jan Høydahl j...@hoydahl.no wrote:
 
  I've been testing
 
 https://chrome.google.com/webstore/detail/mbnigpeabbgkmbcbhkkbnlidcobbapff?hl=enbutI
  don't think it's great.
 
  Great work on this one. Simple and straight forward. A few wishes:
  * Sticky mode? This tool would make sense in a sidebar, to do rapid
  refinements
  * If you edit a value and click TAB, it is not updated :(
  * It should not be necessary to URLencode all non-ascii chars - why not
  leave colon, caret (^) etc as is, for better readability?
  * Some param values in Solr may be large, such as fl, qf or bf.
  Would be nice if the edit box was multi-line, or perhaps adjusts to the
  size of the content
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.facebook.com/Cominvent
  Solr Training - www.solrtraining.com
 
  On 11. mai 2012, at 07:32, Amit Nithian wrote:
 
   Hey all,
  
   I don't know about you but most of the Solr URLs I issue are fairly
   lengthy full of parameters on the query string and browser location
   bars aren't long enough/have multi-line capabilities. I tried to find
   something that does this but couldn't so I wrote a chrome extension to
   help.
  
   Please check out my blog post on the subject and please let me know if
   something doesn't work or needs improvement. Of course this can work
   for any URL with a query string but my motivation was to help edit my
   long Solr URLs.
  
  
 
 http://hokiesuns.blogspot.com/2012/05/manipulating-urls-with-long-query.html
  
   Thanks!
   Amit
 
 



Editing long Solr URLs - Chrome Extension

2012-05-10 Thread Amit Nithian
Hey all,

I don't know about you but most of the Solr URLs I issue are fairly
lengthy full of parameters on the query string and browser location
bars aren't long enough/have multi-line capabilities. I tried to find
something that does this but couldn't so I wrote a chrome extension to
help.

Please check out my blog post on the subject and please let me know if
something doesn't work or needs improvement. Of course this can work
for any URL with a query string but my motivation was to help edit my
long Solr URLs.

http://hokiesuns.blogspot.com/2012/05/manipulating-urls-with-long-query.html

Thanks!
Amit


Re: Solr like for autocomplete field?

2010-11-03 Thread Amit Nithian
I implemented the edge ngrams solution and it's an awesome one
compared to any other that I could think of because I can index more
than just text (other metadata) that can be used to *rank* the
autocomplete results eventually getting to rank by the probability of
selection which is, after all, what you want to try and maximize with
such systems.


On Tue, Nov 2, 2010 at 6:30 PM, Lance Norskog goks...@gmail.com wrote:
 And the SpellingComponent.

 There's nothing to help you with phrases.

 On Tue, Nov 2, 2010 at 11:21 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Also, you might want to consider TermsComponent, see:

 http://wiki.apache.org/solr/TermsComponent

 Also, note that there's an autosuggestcomponent, that's recently been
 committed.

 Best
 Erick

 On Tue, Nov 2, 2010 at 1:56 PM, PeterKerk vettepa...@hotmail.com wrote:


 I have a city field. Now when a user starts typing in a city textbox I want
 to return found matches (like Google).

 So for example, user types new, and I will return new york, new
 hampshire etc.

 my schema.xml

 field name=city type=string indexed=true stored=true/

 my current url:


 http://localhost:8983/solr/db/select/?indent=onfacet=trueq=*:*start=0rows=25fl=idfacet.field=cityfq=city:new


 Basically 2 questions here:
 1. is the url Im using the best practice when implementing autocomplete?
 What I wanted to do, is use the facets for found matches.
 2. How can I match PART of the cityname just like the SQL LIKE command,
 cityname LIKE '%userinput'


 Thanks!
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-like-for-autocomplete-field-tp1829480p1829480.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com



Re: CoreContainer Usage

2010-10-11 Thread Amit Nithian
Hi sorry perhaps my question wasn't very clear. Basically I am trying
to build a federated search where I blend the results of queries to
multiple cores together. This is like distributed search but I believe
the distributed search will issue network calls which I would like to
avoid.

I have read that someone will use a single core as the federated
search handler and then run the searches across multiple cores and
blend the results. This is great but I can't figure out how to easily
get access to an instance of the CoreContainer that I hope has been
initialized (so I am not having it re-parse the configuration files).

Any help would be appreciated.

Thanks!
Amit

On Thu, Oct 7, 2010 at 10:07 AM, Amit Nithian anith...@gmail.com wrote:
 I am trying to understand the multicore setup of Solr more and saw
 that SolrCore.getCore is deprecated in favor of
 CoreContainer.getCore(name). How can I get a reference to the
 CoreContainer for I assume it's been created somewhere in Solr and is
 it possible for one core to get access to another SolrCore via the
 CoreContainer?

 Thanks
 Amit



CoreContainer Usage

2010-10-07 Thread Amit Nithian
I am trying to understand the multicore setup of Solr more and saw
that SolrCore.getCore is deprecated in favor of
CoreContainer.getCore(name). How can I get a reference to the
CoreContainer for I assume it's been created somewhere in Solr and is
it possible for one core to get access to another SolrCore via the
CoreContainer?

Thanks
Amit


Re: Very slow queries

2010-10-07 Thread Amit Nithian
Try stopping replication and see if your query performance may
improve. I think the caches get reset each time replication occurs.
You can look at the cache performance using the admin console.. try
and see if any of the caches are constantly being missed.. this could
be due to your newSearcher/firstSearcher warming queries not doing an
adequate job of warming your caches which can affect performance.
Perhaps the answer could be to allocate more cache space and hence
more VM Heap space.

I hope that this helps some.

- Amit

On Thu, Oct 7, 2010 at 4:32 AM, Christos Constantinou
ch...@simpleweb.co.uk wrote:
 Hello everyone,

 All of a sudden, I am experiencing some very slow queries with solr. I have 
 13GB of indexed documents, each averaging 50-100kb. They have an id key, so I 
 expect to be getting results really fast if I execute 
 id:7cd6cb99fd239c1d743a51bb85a48f790f4a6d3c as the query with no other 
 parameters. Instead, the query may take up to 1 full second (the majority of 
 time spent on org.apache.solr.handler.component.QueryComponent) whereas more 
 complicated queries may take more than a full minute to complete.

 I am not sure where to start looking for the problem. I stopped all the 
 scripts that add and commit the solr server, then restarted solr, but the 
 queries still take just as long.

 Also there is a replication server that runs every 60 seconds, I don't know 
 how that might affect performance.

 Any clues as to how I should investigate this would be appreciated.

 Thanks

 Christos




  1   2   >