Re: PageRank sort
How often are you updating the rank? You might also be able to keep the rank info in a flat file via the ExternalFileField and the FileFloatSource and do FunctionQuery stuff that way. However, I don't know how that handles refreshing data or if it would be efficient in your case. On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote: Hi. I've posted before but here it goes again: I have BlogData data which is more or less 100% static but one field is not - the PageRank. I would like to sort on that field and on the Lucene list I got these answers. 1. Use two indexes and a ParallellReader 2. Use a FieldScoreQuery containing the PageRank field. 3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with other Queries (the actual search). I think I could use this pattern as well: 1. Use two indexes and a ParallellReader 2. Normal search and Sort on the PageRank column (perhaps consuming more memory) Anyone have an idea of howto implement these patterns in SOLR ? I have never extended SOLR but am not afraid of doing so if someone pushes me in the right direction. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: PageRank sort
Hi. Comments inline. On Fri, Apr 24, 2009 at 1:00 PM, Grant Ingersoll gsing...@apache.orgwrote: How often are you updating the rank? The goal is to optimize the pagerank calculating algorithm so can have continuous updates (1 blogs at a time 24/7) but more likely we'll end up refreshing the index once a weeks or so (hopefully each night). You might also be able to keep the rank info in a flat file via the ExternalFileField and the FileFloatSource and do FunctionQuery stuff that way. However, I don't know how that handles refreshing data or if it would be efficient in your case. Great! That seems like something that could work. Depends on how that field get's re-read/indexed I guess. Or is it used at query time solely ? I feel that googling ExternalFileField does not really give the meat I need to narrow this down. Any pointers and/or pseudo code ? On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote: Hi. I've posted before but here it goes again: I have BlogData data which is more or less 100% static but one field is not - the PageRank. I would like to sort on that field and on the Lucene list I got these answers. 1. Use two indexes and a ParallellReader 2. Use a FieldScoreQuery containing the PageRank field. 3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with other Queries (the actual search). I think I could use this pattern as well: 1. Use two indexes and a ParallellReader 2. Normal search and Sort on the PageRank column (perhaps consuming more memory) Anyone have an idea of howto implement these patterns in SOLR ? I have never extended SOLR but am not afraid of doing so if someone pushes me in the right direction. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search It seems to be a generic issue with Lucene since it is not really built in the way that one would plugin an external scoring mechanism (very fast internal one instead) but hopefully I'll sort this one out. Thanks for the reply, really apprecciated. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: PageRank sort
On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Great! That seems like something that could work. Depends on how that field get's re-read/indexed I guess. http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html It's a separate *text* file that just contains id/value pairs for a field. Calculate your custom score and save it to that file. Then call commit so the file is re-read and all your scores will be updated (and usable in a function query). So the short answer is, you should be able to do what you want with no Solr customization in Java... it's all built in. -Yonik http://www.lucidimagination.com
Re: PageRank sort
And I published the setup here: http://dev.tailsweep.com/solr-external-scoring/en/ /M On Sat, Apr 25, 2009 at 12:01 AM, Marcus Herou marcus.he...@tailsweep.comwrote: Works like a charm! Thank you sir. //Marcus On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou marcus.he...@tailsweep.com wrote: That is fantastic, I am creating a really small index right now trying to figure out howto implement the FunctionQuery for this. //Marcus On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Great! That seems like something that could work. Depends on how that field get's re-read/indexed I guess. http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html It's a separate *text* file that just contains id/value pairs for a field. Calculate your custom score and save it to that file. Then call commit so the file is re-read and all your scores will be updated (and usable in a function query). So the short answer is, you should be able to do what you want with no Solr customization in Java... it's all built in. -Yonik http://www.lucidimagination.com -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: PageRank sort
Works like a charm! Thank you sir. //Marcus On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou marcus.he...@tailsweep.comwrote: That is fantastic, I am creating a really small index right now trying to figure out howto implement the FunctionQuery for this. //Marcus On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou marcus.he...@tailsweep.com wrote: Great! That seems like something that could work. Depends on how that field get's re-read/indexed I guess. http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html It's a separate *text* file that just contains id/value pairs for a field. Calculate your custom score and save it to that file. Then call commit so the file is re-read and all your scores will be updated (and usable in a function query). So the short answer is, you should be able to do what you want with no Solr customization in Java... it's all built in. -Yonik http://www.lucidimagination.com -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: PageRank sort
Cool! GET ' http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boostb=blogRank v=$qq}qq=title:solrdebugQuery=on' On Sat, Apr 25, 2009 at 12:43 AM, Marcus Herou marcus.he...@tailsweep.comwrote: That seems wise... PageRank * Text-based Scoring. So you mean in my stupid case that: GET ' http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boosthttp://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=%7B%21boostb=blogRank v=$qq}qq=*:*' would yield the same results as: GET http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=*:*_val_:\log(blogRank)\ since I have no text data but if I introduce a tokenized textfield (title). Example: add docfield name=blogId1/fieldfield name=titlesolr solr solr/field/doc docfield name=blogId2/fieldfield name=titlesolr/field/doc /add where blogId=1 had blogRank of 1 where blogId=2 had blogRank of 2 and if I searched for solr GET ' http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boosthttp://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=%7B%21boostb=blogRank v=$qq}qq=title:solr' I might get blogId=1 as nr1 in the results even though it had lower blogRank due to the higher frequency of the term solr ? Did I understand this correctly ? //Marcus On Sat, Apr 25, 2009 at 12:07 AM, Yonik Seeley yo...@lucidimagination.com wrote: You probably want to mix the custom score with the normal relevancy score... to add, use a normal boolean query. To multiply, check out boosted query: http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html For other options, use a more complex function query with the new query() capability (need to use 1.4 trunk for that though). Examples: q={!boost b=myScore v=$qq}qq=my normal lucene query OR for a dismax relevancy query, q={!boost b=myScore v=$qq}qq={!dismax qf=text_all pf=text_all}solr rocks If the {! type of syntax looks new, check out http://wiki.apache.org/solr/LocalParams powerful stuff! -Yonik http://www.lucidimagination.com -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Re: PageRank sort
Meant this part: m...@mahe-laptop:~$ GET ' http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boostb=blogRank v=$qq}qq=title:solrdebugQuery=on' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime121/int lst name=params str name=qqtitle:solr/str str name=start0/str str name=indenton/str str name=q{!boost b=blogRank v=$qq}/str str name=debugQueryon/str str name=rows100/str /lst /lst result name=response numFound=5 start=0 doc int name=blogId3/int str name=pkId4/str /doc doc int name=blogId1/int str name=pkId1/str /doc doc int name=blogId1/int str name=pkId2/str /doc doc int name=blogId4/int str name=pkId5/str /doc doc int name=blogId2/int str name=pkId3/str /doc /result lst name=debug str name=rawquerystring{!boost b=blogRank v=$qq}/str str name=querystring{!boost b=blogRank v=$qq}/str str name=parsedqueryBoostedQuery(boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)))/str str name=parsedquery_toStringboost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/))/str lst name=explain str name=4 5.7488723 = (MATCH) boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)), product of: 1.9162908 = (MATCH) fieldWeight(title:solr in 13), product of: 2.0 = tf(termFreq(title:solr)=4) 1.9162908 = idf(docFreq=5, numDocs=5) 0.5 = fieldNorm(field=title, doc=13) 3.0 = float(blogRank{type=blogRankFile,properties=})=3.0 /str str name=1 3.8325815 = (MATCH) boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)), product of: 1.9162908 = (MATCH) fieldWeight(title:solr in 10), product of: 1.0 = tf(termFreq(title:solr)=1) 1.9162908 = idf(docFreq=5, numDocs=5) 1.0 = fieldNorm(field=title, doc=10) 2.0 = float(blogRank{type=blogRankFile,properties=})=2.0 /str str name=2 3.3875556 = (MATCH) boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)), product of: 1.6937778 = (MATCH) fieldWeight(title:solr in 11), product of: 1.4142135 = tf(termFreq(title:solr)=2) 1.9162908 = idf(docFreq=5, numDocs=5) 0.625 = fieldNorm(field=title, doc=11) 2.0 = float(blogRank{type=blogRankFile,properties=})=2.0 /str str name=5 1.8746685 = (MATCH) boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)), product of: 1.8746685 = (MATCH) fieldWeight(title:solr in 14), product of: 2.236068 = tf(termFreq(title:solr)=5) 1.9162908 = idf(docFreq=5, numDocs=5) 0.4375 = fieldNorm(field=title, doc=14) 1.0 = float(blogRank{type=blogRankFile,properties=})=1.0 /str str name=3 1.6595565 = (MATCH) boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)), product of: 1.6595565 = (MATCH) fieldWeight(title:solr in 12), product of: 1.7320508 = tf(termFreq(title:solr)=3) 1.9162908 = idf(docFreq=5, numDocs=5) 0.5 = fieldNorm(field=title, doc=12) 1.0 = float(blogRank{type=blogRankFile,properties=})=1.0 /str /lst str name=QParserLuceneQParser/str str name=boost_strblogRank/str str name=boost_parsedorg.apache.solr.search.function.FileFloatSource:FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)/str lst name=timing double name=time106.0/double lst name=prepare double name=time55.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time53.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst lst name=process double name=time50.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time42.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent
PageRank sort
Hi. I've posted before but here it goes again: I have BlogData data which is more or less 100% static but one field is not - the PageRank. I would like to sort on that field and on the Lucene list I got these answers. 1. Use two indexes and a ParallellReader 2. Use a FieldScoreQuery containing the PageRank field. 3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with other Queries (the actual search). I think I could use this pattern as well: 1. Use two indexes and a ParallellReader 2. Normal search and Sort on the PageRank column (perhaps consuming more memory) Anyone have an idea of howto implement these patterns in SOLR ? I have never extended SOLR but am not afraid of doing so if someone pushes me in the right direction. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/