FunctionQuery of FloatFieldSource (Lucene 5.0)

2015-07-14 Thread Peyman Faratin
Hi

I am having problems accessing float values in a lucene 5.0 index via the
functionquery.

My setup is as follows

Indexing time
--

 Document doc = new Document();

FieldType f = new FieldType();

 f.setStored(false);

 f.setNumericType(NumericType.FLOAT);

 f.setDocValuesType(DocValuesType.NUMERIC);

f.setNumericPrecisionStep(4);

 f.setIndexOptions(IndexOptions.DOCS);

   for(EntryInteger, Float component:vector.entrySet()) {

String w  = component.getKey().toString();

   Float score = component.getValue();

 doc.add(new FloatField(w, score, f));

 }

writer.addDocument(doc);

At end of indexing I do

writer.forceMerge(1);

writer.close();


Search Time

--

 for(EntryInteger,Float vector:vectors.entrySet())

 {

w = vector.getKey().toString();

 Float score = (Float) vector.getValue();

 Query  tq= NumericRangeQuery.newFloatRange(w, 0.0f, 100.0f, true, true
);

 FunctionQuery  fq = new FunctionQuery( new FloatFieldSource(w) );

 CustomScoreQuery  customQ = new My_CustomScorerQuery(tq, fq,score);

TopDocs topDocs = indexSearcher.search(customQ,1);

}

where My_CustomScorerQuery() is defined as follows:

public class My_CustomScorerQuery extends CustomScoreQuery{

public My_CustomScorerQuery(Query mainQuery,FunctionQuery valSrcQuery,Float
mainQueryScore) {

super(mainQuery,valSrcQuery);

this.mainQueryScore = mainQueryScore;

}

public CustomScoreProvider getCustomScoreProvider(LeafReaderContext r) {

return new My_CustomScorer(r);

}

private class My_CustomScorer extends CustomScoreProvider{

public My_CustomScorer(LeafReaderContext context) {

super(context);

}

public float customScore(int doc,float subQueryScore, float valSrcScore) {

System.out.println(\thit lucene docID: +doc+

\n\tquery score: +mainQueryScore+

\n\tsubQueryScore: +subQueryScore+

\n\tvalSrcScore: +valSrcScore);


return (float) (mainQueryScore * valSrcScore);

}

}

}

The problem I am seeing is that the `valSrcScore` is always 0, and
sometimes disappears if I change the setNumericPrecisionStep above 4. I
am indexing the following 2 docs

 MapInteger, Float doc1 = new LinkedHashMapInteger, Float();

 doc1.put(12,0.5f);

 doc1.put(18,0.4f);

 doc1.put(10,0.1f);


 indexer.indexVector(doc1, doc1);

  MapInteger, Float doc2 = new LinkedHashMapInteger, Float();

 doc2.put(10,0.9f);

 doc2.put(1,0.8f);

 doc2.put(9,0.2f);

 doc2.put(2,0.1f);


and testing with the following query:

MapInteger, Float query = new LinkedHashMapInteger, Float();

 query.put(10,0.8f);

 query.put(9,0.6f);

 query.put(2,0.01f);


So field `10` in the query should have the following total scores for the
two documents in the index

score(query,doc0) = 0.8*0.1

score(query,doc1) = 0.8*0.9


but I only see

score(query,doc0) = 0.8*0.0

score(query,doc1) = 0.8*0.0


i.e. FloatFieldSource is always returning 0. If I subclass
FloatFieldSource then accessing

NumericDocValues arr = DocValues.getNumeric(readerContext.reader(), field);

tells me NumericDocValues of doc0: 0  which _seems_ to suggest indexing
does not contain the docvalues? I can see the docs fine in Luke. There is a
subtle nuance (related to the way I am indexing the fields -- some fields
in a doc are not present and some are).


Any pointers would be much appreciated


Peyman


custom search component on solrcloud

2015-04-15 Thread Peyman Faratin
Hi

I am trying to port my none solrcloud custom search handler to a solrcloud one. 
I have read the WritingDistibutedSearchComponents wiki page and looked at Terms 
and Querycomponent codes but the control flow of execution is still fuzzy (even 
given the “distributed algorithm” description). 

Concretely, I have a none solrcloud algorithm that given a sequence of tokens T 
would 

1- split T into single tokens
2- foreach token t_i
get all the DocList for t_i by executing 
rb.req.getSearcher().getDocList in process() method of the custom search 
component

3- do some magic on the collection of doclists

My question is how can i 

1) do the splitting (step 1 above) in a single shard, and
2) distribute the getDocList for each token t_i to all shards
3) wait till i have all the doclists from all shards, then
4) do something with the results, in the original calling shard (step 1 above). 

Thank you for your help

commons-configuration NoClassDefFoundError: Predicate

2014-07-23 Thread Peyman Faratin
Hi

I've tried all permutations with no results so I thought I write to the group 
for help. 

I am running commons config 
(http://commons.apache.org/proper/commons-configuration/) just fine via maven 
and ant but when I try to run the class calling the method 
PropertiesConfiguration via a SOLR search component I get the following error

 org.eclipse.jetty.servlet.ServletHandler  – Error for /solr/ArticlesRaw/ingest
java.lang.NoClassDefFoundError: org/apache/commons/collections/Predicate
at com.xyz.logic(Ingest.java:106)
at com.xyz.logic.process(Runngest.java:76)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:217)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:533)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.collections.Predicate
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:430)
at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)


Following suggestions here

http://stackoverflow.com/questions/7651799/proper-usage-of-apache-commons-configuration/7651867#7651867

I am including the appropriate jars in solrconfig.xml 

  lib dir=${mvnRepository}/commons-lang/commons-lang/2.6/ regex=.*\.jar/
  lib dir=${mvnRepository}/commons-collections/commons-collections/3.2.1/ 
regex=.*\.jar/  
  lib dir=${mvnRepository}/commons-logging/commons-logging/1.1.1/ 
regex=.*\.jar/  
  lib dir=${mvnRepository}/commons-configuration/commons-configuration/1.10/ 
regex=.*\.jar/  

(the class is in org.apache.commons.collections.Predicate 
commons-collections/3.2.1 jar)

I am running solr 4.7.1

Any help would be much appreciated

Peyman




Deleting and committing inside a SearchComponent

2013-12-03 Thread Peyman Faratin
Hi

Is it possible to delete and commit updates to an index inside a custom 
SearchComponent? I know I can do it with solrj but due to several business 
logic requirements I need to build the logic inside the search component.  I am 
using SOLR 4.5.0. 

thank you

Re: Deleting and committing inside a SearchComponent

2013-12-03 Thread Peyman Faratin

On Dec 3, 2013, at 8:41 PM, Upayavira u...@odoko.co.uk wrote:

 
 
 On Tue, Dec 3, 2013, at 03:22 PM, Peyman Faratin wrote:
 Hi
 
 Is it possible to delete and commit updates to an index inside a custom
 SearchComponent? I know I can do it with solrj but due to several
 business logic requirements I need to build the logic inside the search
 component.  I am using SOLR 4.5.0. 
 
 That just doesn't make sense. Search components are read only.
 
i can think of many situations that it makes sense. for instance, you search 
for a document and your index contains many duplicates that only differ by one 
field, such as the time they were indexed (think news feeds from multiple 
sources). So after the search we want to delete the duplicate documents that 
satisfy some policy (here date, but it could be some other policy). 

 What are you trying to do? What stuff do you need to change? Could you
 do it within an UpdateProcessor?

Solution i am working with 

UpdateRequestProcessorChain processorChain = 
rb.req.getCore().getUpdateProcessingChain(rb.req.getParams().get(UpdateParams.UPDATE_CHAIN));
UpdateRequestProcessor processor = processorChain.createProcessor(rb.req, 
rb.rsp);
...
docId = f();
...
DeleteUpdateCommand cmd = new DeleteUpdateCommand(req);
cmd.setId(docId.toString());
processor.processDelete(cmd);


 
 Upayavira



deleting a doc inside a custom UpdateRequestProcessor

2013-11-18 Thread Peyman Faratin
Hi

I am building a custom UpdateRequestProcessor to intercept any doc heading to 
the index. Basically what I want to do is to check if the current index has a 
doc with the same title (i am using IDs as the uniques so I can't use that, and 
besides the logic of checking is a little more complicated). If the incoming 
doc has a duplicate and some other conditions hold then one of 2 things can 
happen:

1- we don't index the incoming document
2- we index the incoming and delete the duplicate currently in the index

I think (1) can be done by simple not passing the call up the chain (not 
calling super.processAdd(cmd)). However, I don't know how to implement the 
second condition, deleting the duplicate document, inside a custom 
UpdateRequestProcessor. This thread is the closest to my goal 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-Migration-How-to-use-DeleteUpdateCommand-td4062454.html

however i am not clear how to proceed. Code snippets below.

thank you in advance for your help

class isDuplicate extends UpdateRequestProcessor 
{
public isDuplicate( UpdateRequestProcessor next) { 
  super( next ); 
} 
@Override 
public void processAdd(AddUpdateCommand cmd) throws IOException 
{   
try 
{
boolean indexIncomingDoc = 
checkIfIsDuplicate(cmd); 
if(indexIncomingDoc)
super.processAdd(cmd);  

} catch (SolrServerException e) {e.printStackTrace();} 
catch (ParseException e) {e.printStackTrace();}
} 
public boolean checkIfIsDuplicate(AddUpdateCommand cmd) ...{

SolrInputDocument incomingDoc = 
cmd.getSolrInputDocument();
if(incomingDoc == null) return false;
String title = (String) incomingDoc.getFieldValue( 
title );
SolrIndexSearcher searcher = 
cmd.getReq().getSearcher();
boolean addIncomingDoc = true;
Integer idOfDuplicate = searcher.getFirstMatch(new 
Term(title,title));
if(idOfDuplicate != -1) 
{
addIncomingDoc = 
compareDocs(searcher,incomingDoc,idOfDuplicate,title,addIncomingDoc);
}
return addIncomingDoc;  
}
private boolean compareDocs(.){ 

if( condition 1 ) 
{
-- DELETE DUPLICATE DOC in INDEX --
addIncomingDoc = true;
}

return addIncomingDoc;
}

Re: subindex

2013-09-08 Thread Peyman Faratin
Hi Erick

it makes sense. Thank you for this. 

peyman

On Sep 5, 2013, at 4:11 PM, Erick Erickson erickerick...@gmail.com wrote:

 Nope. You can do this if you've stored _all_ the fields (with the exception
 of
 _version_ and the destinations of copyField directives). But there's no way
 I
 know of to do what you want if you haven't.
 
 If you have, you'd be essentially spinning through all your docs and
 re-indexing
 just the fields you cared about. But if you still have access to your
 original
 docs this would be slower/more complicated than just re-indexing from
 scratch.
 
 Best
 Erick
 
 
 On Wed, Sep 4, 2013 at 1:51 PM, Peyman Faratin pey...@robustlinks.comwrote:
 
 Hi
 
 Is there a way to build a new (smaller) index from an existing (larger)
 index where the smaller index contains a subset of the fields of the larger
 index?
 
 thank you



subindex

2013-09-04 Thread Peyman Faratin
Hi

Is there a way to build a new (smaller) index from an existing (larger) index 
where the smaller index contains a subset of the fields of the larger index? 

thank you

Re: State sharing

2013-08-20 Thread Peyman Faratin
got it. thank you Jack and Shalin

On Aug 19, 2013, at 9:52 AM, Jack Krupansky j...@basetechnology.com wrote:

 Generally, you shouldn't be trying to maintain, let alone share state in 
 Solr itself. It sounds like you need an application layer between your 
 application clients and Solr which could then maintain whatever state it 
 needs.
 
 -- Jack Krupansky
 
 -Original Message- From: Peyman Faratin
 Sent: Saturday, August 17, 2013 12:29 PM
 To: solr-user@lucene.apache.org
 Subject: State sharing
 
 Hi
 
 I have subclassed a SearchComponent (call this class S), and would like to 
 implement the following transaction logic:
 
 1- Client K calls the S's handler
 
 2- S spawns a thread and immediately acks K using 
 rb.rsp.add(status,complete) then terminates
 
 public void process (ResponseBuilder rb)
 {
 SolrParams params = rb.req.getParams();
 
 try{
 ExecutorService executorService = Executors.newCachedThreadPool();
 
 Processor job = new  Processor(rb);
 
 executorService.submit(job);
 rb.rsp.add(status,complete);
 
 }catch(Exception e) {e.printStackTrace();};
 }
 
 3- The thread S started (job above) does two chunks of logic in serial 
 (call these B and C):
 
 i) B does some processing and sends client K a series of status updates, then
 ii) C does some processing and in turn sends K series of status updates then 
 one final complete message
 iii)  transaction ends
 
 I am using SOLR 4.3.1. How can I support such a transaction in solr? I've 
 tried sharing S's ResponseBuilder with the thread but presumably because S 
 terminates in step 2 K will never see the response from B and C. In general I 
 would like to implement a mechanism that can share processing state with the 
 client in the same http session.
 
 thank you for your help
 
 Peyman
 
 



State sharing

2013-08-17 Thread Peyman Faratin
Hi

I have subclassed a SearchComponent (call this class S), and would like to 
implement the following transaction logic:

1- Client K calls the S's handler

2- S spawns a thread and immediately acks K using 
rb.rsp.add(status,complete) then terminates

public void process (ResponseBuilder rb) 
{
SolrParams params = rb.req.getParams();

try{ 
ExecutorService executorService = 
Executors.newCachedThreadPool();

Processor job = new  Processor(rb);

executorService.submit(job);

rb.rsp.add(status,complete);

}catch(Exception e) {e.printStackTrace();};
}

3- The thread S started (job above) does two chunks of logic in serial (call 
these B and C):

i) B does some processing and sends client K a series of status 
updates, then
ii) C does some processing and in turn sends K series of status updates 
then one final complete message
iii)  transaction ends

I am using SOLR 4.3.1. How can I support such a transaction in solr? I've tried 
sharing S's ResponseBuilder with the thread but presumably because S terminates 
in step 2 K will never see the response from B and C. In general I would like 
to implement a mechanism that can share processing state with the client in the 
same http session. 

thank you for your help 

Peyman




Re: cores sharing an instance

2013-06-30 Thread Peyman Faratin
I see. If I wanted to try the second option (find a place inside solr before 
the core is created) then where would that place be in the flow of app waking 
up? Currently what I am doing is each core loads its app caches via a 
requesthandler (in solrconfig.xml) that initializes the java class that does 
the loading. For instance:

requestHandler name=/cachedResources class=solr.SearchHandler 
startup=lazy 
   arr name=last-components
 strAppCaches/str
   /arr
/requestHandler
searchComponent name=AppCaches class=com.name.Project.AppCaches/ 


So each core has its own so specific cachedResources handler. Where in SOLR 
would I need to place the AppCaches code to make it visible to all other cores 
then?

thank you Roman

On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Cores can be reloaded, they are inside solrcore loader /I forgot the exact
 name/, and they will have different classloaders /that's servlet thing/, so
 if you want singletons you must load them outside of the core, using a
 parent classloader - in case of jetty, this means writing your own jetty
 initialization or config to force shared class loaders. or find a place
 inside the solr, before the core is created. Google for montysolr to see
 the example of the first approach.
 
 But, unless you really have no other choice, using singletons is IMHO a bad
 idea in this case
 
 Roman
 
 On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote:
 
 its the singleton pattern, where in my case i want an object (which is
 RAM expensive) to be a centralized coordinator of application logic.
 
 thank you
 
 On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com
 wrote:
 
 There is very little shared between multiple cores (instanceDir paths,
 logging config maybe?). Why are you trying to do this?
 
 On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com
 wrote:
 Hi
 
 I have a multicore setup (in 4.3.0). Is it possible for one core to
 share an instance of its class with other cores at run time? i.e.
 
 At run time core 1 makes an instance of object O_i
 
 core 1 -- object O_i
 core 2
 ---
 core n
 
 then can core K access O_i? I know they can share properties but is it
 possible to share objects?
 
 thank you
 
 
 
 
 --
 Regards,
 Shalin Shekhar Mangar.
 



Re: cores sharing an instance

2013-06-30 Thread Peyman Faratin
That is what I had assumed but it appears not to be the case. A class (and its 
properties) of one core is not visible to another class in another core - in 
the same JVM. 

Peyman

On Jun 29, 2013, at 1:23 PM, Erick Erickson erickerick...@gmail.com wrote:

 Well, the code is all in the same JVM, so there's no
 reason a singleton approach wouldn't work that I
 can think of. All the multithreaded caveats apply.
 
 Best
 Erick
 
 
 On Fri, Jun 28, 2013 at 3:44 PM, Peyman Faratin pey...@robustlinks.comwrote:
 
 Hi
 
 I have a multicore setup (in 4.3.0). Is it possible for one core to share
 an instance of its class with other cores at run time? i.e.
 
 At run time core 1 makes an instance of object O_i
 
 core 1 -- object O_i
 core 2
 ---
 core n
 
 then can core K access O_i? I know they can share properties but is it
 possible to share objects?
 
 thank you
 
 



Re: cores sharing an instance

2013-06-29 Thread Peyman Faratin
its the singleton pattern, where in my case i want an object (which is RAM 
expensive) to be a centralized coordinator of application logic. 

thank you

On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
wrote:

 There is very little shared between multiple cores (instanceDir paths,
 logging config maybe?). Why are you trying to do this?
 
 On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com 
 wrote:
 Hi
 
 I have a multicore setup (in 4.3.0). Is it possible for one core to share an 
 instance of its class with other cores at run time? i.e.
 
 At run time core 1 makes an instance of object O_i
 
 core 1 -- object O_i
 core 2
 ---
 core n
 
 then can core K access O_i? I know they can share properties but is it 
 possible to share objects?
 
 thank you
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



cores sharing an instance

2013-06-28 Thread Peyman Faratin
Hi 

I have a multicore setup (in 4.3.0). Is it possible for one core to share an 
instance of its class with other cores at run time? i.e.

At run time core 1 makes an instance of object O_i

core 1 -- object O_i
core 2
---
core n

then can core K access O_i? I know they can share properties but is it possible 
to share objects?

thank you



Upgrading from 3.6.1 to 4.3.0 and Custom collector

2013-06-17 Thread Peyman Faratin
Hi 

I am migrating from Lucene 3.6.1 to 4.3.0. I am however not sure how to migrate 
my custom collector below. this page 
http://lucene.apache.org/core/4_3_0/MIGRATE.html gives some hints but the 
instructions are incomplete and looking at the source examples of custom 
collectors make me want to go and eat cheesecake - every time !!!

Any advise would be very much appreciated 

thank you


public class AllInLinks extends Collector {
  private Scorer scorer;
  private int docBase;
  private String[] store;
  private HashSetString outLinks = new HashSetString();

  public boolean acceptsDocsOutOfOrder() {
return true;
  }
  public void setScorer(Scorer scorer) {
this.scorer = scorer;
  }
  public void setNextReader(IndexReader reader, int docBase) 
throws IOException{
this.docBase = docBase;
store = FieldCache.DEFAULT.getStrings(reader,title);
  }
  public void collect(int doc) throws IOException {
  String page = store[doc];
  outLinks.add(page);
  }
  public void reset() {
  outLinks.clear();
  store = null;
  }
  public int getOutLinks() {
return outLinks.size();
  }
}



setting bq in searchcomponent

2013-03-21 Thread Peyman Faratin
Hi

If I run a main query cheeze jointly with a boost query bq=spell:cheeze 
(boosting results with spell field cheeze), as

/select?fl=titleqf=mainbq=spell:cheezebq=trans:cheezeq=cheeze

everything works fine. And defType=dismax

What I'd like to do is to programmatically generate the bq query inside a 
custom searchcomponent's process method and issue the query similar to above. 
I can achieve my goal with explicitly constructing and running a query as 
follows

StringBuilder QueryStr = new StringBuilder();
QueryStr.append(echoParams=none);
QueryStr.append(debugQuery=off);
QueryStr.append(defType=dismax);
QueryStr.append(df=main);
QueryStr.append(q=+token);
QueryStr.append(bq=spell:+token); 
QueryStr.append(bq=trans:+token); 
SolrParams query = 
SolrRequestParsers.parseQueryString(QueryStr.toString());

rb.req.setParams(query);
Query q = QParser.getParser(token, defType, rb.req).parse();
DocList hits = 
searcher.getDocList(q,rb.getFilters(),Sort.RELEVANCE,offset,rows,fieldFlags);


But is there a way to directly set the request What would be the best way to do 
this?

public void process(ResponseBuilder rb) throws IOException {
 
...

String token = rb.req.getParams().get(token);

String bqfield = rb.req.getParams().get(DisMaxParams.BQ);   

.

Query q = QParser.getParser(token, defType, rb.req).parse();

DocList hits = 
searcher.getDocList(q,rb.getFilters(),Sort.RELEVANCE,offset,rows,fieldFlags);

}


without having to explicitly constructing the query?

thank you 

Peyman

faceting and clustering on MLT via stream.body

2013-02-22 Thread Peyman Faratin
Hi

I would to run a mlt search (in Solrj) of a short piece of text delivered via 
the stream.body. This part works. What I would like to be able to do is to do 2 
things:

- faceting on some number (not ALL) of the results
- cluster (using carrot2) all of the results

Is this possible? I believe faceting occurs on all of the docs returned 
(numFound), not on the requested number of results (rows). Is this correct?

thank you for your help

Peyman

recommended SSD

2012-08-23 Thread Peyman Faratin
Hi

Is there a SSD brand and spec that the community recommends for an index of 
size 56G with mostly reads? We are evaluating this one

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227706

thank you

Peyman




synonym file

2012-08-02 Thread Peyman Faratin
Hi

I have a (23M) synonym file that takes a long time (3 or so minutes) to load 
and once included seems to adversely affect the QTime of the application by 
approximately 4 orders of magnitude. 

Any advise on how to load faster and lower the QT would be much appreciated. 

best

Peyman

Re: index writer in searchComponent

2012-07-01 Thread Peyman Faratin
Hi Dmitry
Which SolrJ API would I use to receive the user query? I was under the 
impression the request handler mechanism was the (RESTFUL) interface between 
user query and the index/s. 
thank you
Peyman

On Jul 1, 2012, at 10:11 AM, Dmitry Kan wrote:

 Hi Peyman,
 
 Could you just use solrj api for this purpose? That is, ask via solrj api
 1-2 and perform 3 if entity (assuming you mean document or some field value
 by X) didn't exist, i.e. add it to the index.
 
 // Dmitry
 
 On Sun, Jul 1, 2012 at 6:03 AM, Peyman Faratin pey...@robustlinks.comwrote:
 
 Hi Erik
 
 The workflow I'd like to implement is
 
 1- search the index using the incoming query
 2- the query is of the type does entity X exist
 3- if X does not exist in the index then I'd like to add X to the index
 
 Currently I am using a custom search component to achieve this by creating
 a solrserver within the init (or inform) method of the search component and
 using that instance to update (and commit) the index. I am not sure this is
 the best approach either and thought using the IndexReader of the search
 component itself maybe better.
 
 Is there a better approach in your opinion?
 
 thank you Erik
 
 Peyman
 
 On Jun 30, 2012, at 8:13 PM, Erick Erickson wrote:
 
 Lots of the index modification (all of it?) has been removed in 4.0
 from IndexReaders...
 
 It seems like you could always get the directory and open a
 SolrIndexWriter wherever you wanted,
 but I'm not sure it's a good idea, are there other processes that will
 be writing to the index at the
 same time?
 
 What's the purpose here anyway? There might be a better approach
 
 Best
 Erick
 
 On Thu, Jun 28, 2012 at 4:02 PM, Peyman Faratin pey...@robustlinks.com
 wrote:
 Hi
 
 Is it possible to add a new document to the index in a custom
 SearchComponent (that also implements a SolrCoreAware)? I can get a
 reference to the indexReader via the ResponseBuilder parameter of the
 process() method using
 
 rb.req.getSearcher().getReader()
 
 But is it possible to actually add a new document to the index _after_
 searching the index? I.e accessing the indexWriter?
 
 thank you
 
 Peyman
 
 
 
 
 -- 
 Regards,
 
 Dmitry Kan



Re: index writer in searchComponent

2012-06-30 Thread Peyman Faratin
Hi Erik

The workflow I'd like to implement is 

1- search the index using the incoming query
2- the query is of the type does entity X exist
3- if X does not exist in the index then I'd like to add X to the index

Currently I am using a custom search component to achieve this by creating a 
solrserver within the init (or inform) method of the search component and using 
that instance to update (and commit) the index. I am not sure this is the best 
approach either and thought using the IndexReader of the search component 
itself maybe better. 

Is there a better approach in your opinion?

thank you Erik

Peyman

On Jun 30, 2012, at 8:13 PM, Erick Erickson wrote:

 Lots of the index modification (all of it?) has been removed in 4.0
 from IndexReaders...
 
 It seems like you could always get the directory and open a
 SolrIndexWriter wherever you wanted,
 but I'm not sure it's a good idea, are there other processes that will
 be writing to the index at the
 same time?
 
 What's the purpose here anyway? There might be a better approach
 
 Best
 Erick
 
 On Thu, Jun 28, 2012 at 4:02 PM, Peyman Faratin pey...@robustlinks.com 
 wrote:
 Hi
 
 Is it possible to add a new document to the index in a custom 
 SearchComponent (that also implements a SolrCoreAware)? I can get a 
 reference to the indexReader via the ResponseBuilder parameter of the 
 process() method using
 
 rb.req.getSearcher().getReader()
 
 But is it possible to actually add a new document to the index _after_ 
 searching the index? I.e accessing the indexWriter?
 
 thank you
 
 Peyman



index writer in searchComponent

2012-06-28 Thread Peyman Faratin
Hi

Is it possible to add a new document to the index in a custom SearchComponent 
(that also implements a SolrCoreAware)? I can get a reference to the 
indexReader via the ResponseBuilder parameter of the process() method using

rb.req.getSearcher().getReader()

But is it possible to actually add a new document to the index _after_ 
searching the index? I.e accessing the indexWriter?

thank you

Peyman

KeywordTokenizerFactory with SynonymFilterFactory

2012-06-16 Thread Peyman Faratin
Hi

I have the following 2 field types

fieldType name=tokenizer1 class=solr.TextField sortMissingLast=true 
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=false expand=true/ 
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


fieldType name=tokenizer2 class=solr.TextField sortMissingLast=true 
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=false expand=true/ 
  /analyzer
/fieldType

The problem I am seeing is if I have an entry as this in the synonyms.txt file

helping hand = assistance

then issuing helping hand query (with dismax) to the field tokenized with 
tokenizer1 returns the correct query (assistance) whereas there is no synonym 
mapping for tokenizer2 (confirmed in Solr admin panel). 

Am I doing something wrong?

thank you




Re: KeywordTokenizerFactory with SynonymFilterFactory

2012-06-16 Thread Peyman Faratin
thank you Michael.

On Jun 16, 2012, at 6:40 PM, Michael Ryan wrote:

 Try changing the tokenizer2 SynonymFilterFactory filter to this:
 
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=false expand=true 
 tokenizerFactory=solr.KeywordTokenizerFactory/
 
 By default, it seems that it uses WhitespaceTokenizer.
 
 -Michael



Kernel methods in SOLR

2012-04-23 Thread Peyman Faratin
Hi

Has there been any work that tries to integrate Kernel methods [1] with SOLR? I 
am interested in using kernel methods to solve synonym, hyponym and polysemous 
(disambiguation) problems which SOLR's Vector space model (bag of words) does 
not capture. 

For example, imagine we have only 3 words in our corpus, puma, cougar and 
feline. The 3 words have obviously interdependencies (puma disambiguates to 
cougar, cougar and puma are instances of felines - hyponyms). Now, imagine 2 
docs, d1 and d2, that have the following TF-IDF vectors. 

 puma, cougar, feline
d1   =   [  2,0, 0]
d2   =   [  0,1, 0]

i.e. d1 has no mention of term cougar or feline and conversely, d2 has no 
mention of terms puma or feline. Hence under the vector approach d1 and d2 are 
not related at all (and each interpretation of the terms have a unique vector). 
Which is not what we want to conclude. 

What I need is to include a kernel matrix (as data) such as the following that 
captures these relationships:

   puma, cougar, feline
puma=   [  1,1, 0.4]
cougar  =   [  1,1, 0.4]
feline  =   [  0.4, 0.4, 1]

then recompute the TF-IDF vector as a product of (1) the original vector and 
(2) the kernel matrix, resulting in

 puma, cougar, feline
d1   =   [  2,2, 0.8]
d2   =   [  1,1, 0.4]

(note, the new vectors are much less sparse). 

I can solve this problem (inefficiently) at the application layer but I was 
wondering if there has been any attempts within the community to solve similar 
problems, efficiently without paying a hefty response time price?

thank you 

Peyman

[1] http://en.wikipedia.org/wiki/Kernel_methods

custom field default qf of requestHandler

2012-04-03 Thread Peyman Faratin
Hi

I have a problem with the following context. 

I have a field with a custom type of shingledcontent, defined as follows in 
the schema.xml

   field name=shingledContent 
type=shingledcontent 
compressed=true 
omitNorms=false 
termVectors=true 
termOffsets=true 
termPositions=true 
indexed=true 
stored=false 
multiValued=false
required=true / 

where

fieldType name=shingledcontent class=solr.TextField sortMissingLast=true
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ShingleFilterFactory outputUnigrams=true 
maxShingleSize=2/
  /analyzer
  
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ShingleFilterFactory outputUnigrams=false 
maxShingleSize=2/
  /analyzer 
 /fieldType

I then define a request handler as follows

requestHandler name=/test class=solr.SearchHandler
lst name=defaults
   str name=defTypedismax/str
   str name=q.alt /str 
   str name=fltitle,score/str 
   int name=start0/int
   int name=rows2000/int
   str name=echoParamsall/str
   str name=qftitleAnalyzed^2.0 shingledContent^1.0 content^1.0/str
/lst
 lst name=appends
   str name=fqsomeTest:false/str 
   str name=fqanotherTest:false/str 
 /lst
 arr name=last-components
 strTest/str
 /arr
 /requestHandler
 searchComponent name=Test class=com.a.b.c/
 
the problem I am seeing is that the shingledContent field query never shows up 
in the query - what I see is:

Q: +(content:dog | titleAnalyzed:dog^2.0) ()

(content and titleAnalyzed are both of type text_general, found in the 
default schema.xml). If I change shingedContent field to be of type 
text_general it is correctly included in the query field. 

Is this a correct behavior or am i making an error somewhere?

thank you







query score across ALL docs

2012-03-28 Thread Peyman Faratin
Hi

What is the best way to retrieve the score of a query across ALL documents in 
the index? i.e.

given:

1) docs,  [A,B,C,D,E,...M]  of M dimensions

2) Query q

searcher outputs (efficiently)

1) the score of q across _all_ M dimensional documents, ordered by index 
number. i.e

score(q) = [A=0.1,B=0.0,M=0.76]

Currently the searcher outputs the top N matches, where (often) N M in cases 
of large indices.  My index is ~9MM docs. Using a custom collector will not 
work. 

Any advice would be much appreciated

Peyman




QueryHandler

2012-03-26 Thread Peyman Faratin
Hi

A noobie question. I am uncertain what is the best way to design for my 
requirement which the following.

I want to allow another client in solrj to query solr with a query that is 
handled with a custom handler

localhost:9090/solr/tokenSearch?tokens{!dismax 
qf=content}pear,apples,oyster,king kongfl=scorerows=1000

i.e. a list of tokens (single word and phrases) is sent in one http call. 

What I would like to do is to search over each individual token and compose a 
single response back to the client

The current approach I have taken is to create a custom search handler as 
follows

requestHandler name=/tokenSearch class=solr.SearchHandler
 lst name=defaults
   str name=defTypedismax/str 
  /lst
  arr name=components
   strmyHandler/str
   /arr
/requestHandler

searchComponent name=myHandler class=com.a.RequestHandlers.myHandler/
   
myHandler (which extends SearchComponent) overrides prepare and process 
methods, extracting and iterating over each token in the input. The problem I 
am hitting in this design is that the prepare() method is passed a reference to 
the SolrIndexSearcher in the ResponseBuilder parameter (so for efficiency 
reasons i don't want to open up another server connection for the search). I 
can construct a Lucene query and search just fine, but what i would like to do 
is instead use the e/dismax queries (rather than construct my own - to reduce 
errors). The getDocList() method of SolrIndexSearcher on the other hand 
requires a lucene query object. 

Is this an appropriate design for my requirement? And if so what is the best 
way to send a SolrQuery to the SolrIndexSearcher?

Thank you 

Peyman

Re: Faster Solr Indexing

2012-03-19 Thread Peyman Faratin
Hi Erick, Dimitry and Mikhail

thank you all for your time. I tried all of the suggestions below and am happy 
to report that indexing speeds have improved. There were several confounding 
problems including

- a bank of (~20) regexes that were poorly optimized and compiled at each 
indexing step
- single threaded
- not using StreamingUpdateSolrServer
- excessive logging

However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the 
time of building the SOLR document. Indexing sped up after precomputing these 
values offline.

Thank you all for your help. 

best

Peyman 

On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote:

 How have you determined that it's the solr add? By timing the call on the
 SolrJ side or by looking at the machine where Solr is running? This is the
 very first thing you have to answer. You can get a rough ides with any
 simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
 box). The point is just to see whether the indexer machine is being
 well utilized. I'd guess it's not actually.
 
 One quick experiment would be to try using StreamingUpdateSolrServer
 (SUSS), which has the capability of having multiple threads
 fire at Solr at once. It is possible that your performance is spent
 waiting for I/O.
 
 Once you have that question answered, you can refine. But until you
 know which side of the wire the problem is on, you're flying blind.
 
 Both Yandong Peyman:
 These times are quite surprising. Running everything locally on my laptop,
 I'm indexing between 5-7K documents/second. The source is
 the Wikipedia dump.
 
 I'm particularly surprised by the difference Yandong is seeing based
 on the various analysis chains. the first thing I'd back off is the
 MaxPermSize. 512M is huge for this parameter.
 If you're getting that kind of time differential and your CPU isn't
 pegged, you're probably swapping in which case you need
 to give the processes more memory. I'd just take the MaxPermSize
 out completely as a start.
 
 Not sure if you've seen this page, something there might help.
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 
 But throw a profiler at the indexer as a first step, just to see
 where the problem is, CPU or I/O.
 
 Best
 Erick
 
 On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin pey...@robustlinks.com 
 wrote:
 Hi
 
 I am trying to index 12MM docs faster than is currently happening in Solr 
 (using solrj). We have identified solr's add method as the bottleneck (and 
 not commit - which is tuned ok through mergeFactor and maxRamBufferSize and 
 jvm ram).
 
 Adding 1000 docs is taking approximately 25 seconds. We are making sure we 
 add and commit in batches. And we've tried both CommonsHttpSolrServer and 
 EmbeddedSolrServer (assuming removing http overhead would speed things up 
 with embedding) but the differences is marginal.
 
 The docs being indexed are on average 20 fields long, mostly indexed but 
 none stored. The major size contributors are two fields:
 
- content, and
- shingledContent (populated using copyField of content).
 
 The length of the content field is (likely) gaussian distributed (few large 
 docs 50-80K tokens, but majority around 2k tokens). We use shingledContent 
 to support phrase queries and content for unigram queries (following the 
 advice of Solr Enterprise search server advice - p. 305, section The 
 Solution: Shingling).
 
 Clearly the size of the docs is a contributor to the slow adds (confirmed by 
 removing these 2 fields resulting in halving the indexing time). We've tried 
 compressed=true also but that is not working.
 
 Any guidance on how to support our application logic (without having to 
 change the schema too much) and speed the indexing speed (from current 212 
 days for 12MM docs) would be much appreciated.
 
 thank you
 
 Peyman
 



Faster Solr Indexing

2012-03-10 Thread Peyman Faratin
Hi

I am trying to index 12MM docs faster than is currently happening in Solr 
(using solrj). We have identified solr's add method as the bottleneck (and not 
commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm 
ram). 

Adding 1000 docs is taking approximately 25 seconds. We are making sure we add 
and commit in batches. And we've tried both CommonsHttpSolrServer and 
EmbeddedSolrServer (assuming removing http overhead would speed things up with 
embedding) but the differences is marginal. 

The docs being indexed are on average 20 fields long, mostly indexed but none 
stored. The major size contributors are two fields:

- content, and
- shingledContent (populated using copyField of content).

The length of the content field is (likely) gaussian distributed (few large 
docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to 
support phrase queries and content for unigram queries (following the advice of 
Solr Enterprise search server advice - p. 305, section The Solution: 
Shingling). 

Clearly the size of the docs is a contributor to the slow adds (confirmed by 
removing these 2 fields resulting in halving the indexing time). We've tried 
compressed=true also but that is not working. 

Any guidance on how to support our application logic (without having to change 
the schema too much) and speed the indexing speed (from current 212 days for 
12MM docs) would be much appreciated. 

thank you

Peyman