Re: PageRank sort

2009-04-24 Thread Grant Ingersoll

How often are you updating the rank?

You might also be able to keep the rank info in a flat file via the  
ExternalFileField and the FileFloatSource and do FunctionQuery stuff  
that way.   However, I don't know how that handles refreshing data or  
if it would be efficient in your case.


On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote:


Hi.

I've posted before but here it goes again:

I have BlogData data which is more or less 100% static but one field  
is not

- the PageRank.
I would like to sort on that field and on the Lucene list I got these
answers.

1. Use two indexes and a ParallellReader
2. Use a FieldScoreQuery containing the PageRank field.
3. Use a CustomScoreQuery which uses the FieldScoreQuery combined  
with other

Queries (the actual search).

I think I could use this pattern as well:
1. Use two indexes and a ParallellReader
2. Normal search and Sort on the PageRank column (perhaps consuming  
more

memory)

Anyone have an idea of howto implement these patterns in SOLR ?
I have never extended SOLR but am not afraid of doing so if someone  
pushes

me in the right direction.

Kindly

//Marcus




--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: PageRank sort

2009-04-24 Thread Marcus Herou
Hi.

Comments inline.

On Fri, Apr 24, 2009 at 1:00 PM, Grant Ingersoll gsing...@apache.orgwrote:

 How often are you updating the rank?


The goal is to optimize the pagerank calculating algorithm so can have
continuous updates (1 blogs at a time 24/7) but more likely we'll end up
refreshing the index once a weeks or so (hopefully each night).




 You might also be able to keep the rank info in a flat file via the
 ExternalFileField and the FileFloatSource and do FunctionQuery stuff that
 way.   However, I don't know how that handles refreshing data or if it would
 be efficient in your case.


Great! That seems like something that could work. Depends on how that field
get's re-read/indexed I guess. Or is it used at query time solely ? I feel
that googling ExternalFileField does not really give the meat I need to
narrow this down. Any pointers and/or pseudo code ?



 On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote:

  Hi.

 I've posted before but here it goes again:

 I have BlogData data which is more or less 100% static but one field is
 not
 - the PageRank.
 I would like to sort on that field and on the Lucene list I got these
 answers.

 1. Use two indexes and a ParallellReader
 2. Use a FieldScoreQuery containing the PageRank field.
 3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with
 other
 Queries (the actual search).

 I think I could use this pattern as well:
 1. Use two indexes and a ParallellReader
 2. Normal search and Sort on the PageRank column (perhaps consuming more
 memory)

 Anyone have an idea of howto implement these patterns in SOLR ?
 I have never extended SOLR but am not afraid of doing so if someone pushes
 me in the right direction.

 Kindly

 //Marcus




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search


It seems to be a generic issue with Lucene since it is not really built in
the way that one would plugin an external scoring mechanism (very fast
internal one instead) but hopefully I'll sort this one out.

Thanks for the reply, really apprecciated.

Kindly

//Marcus



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Yonik Seeley
On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
marcus.he...@tailsweep.com wrote:
 Great! That seems like something that could work. Depends on how that field
 get's re-read/indexed I guess.

http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

It's a separate *text* file that just contains id/value pairs for a field.
Calculate your custom score and save it to that file.  Then call
commit so the file is re-read and all your scores will be updated (and
usable in a function query).

So the short answer is, you should be able to do what you want with no
Solr customization in Java... it's all built in.

-Yonik
http://www.lucidimagination.com


Re: PageRank sort

2009-04-24 Thread Marcus Herou
And I published the setup here:
http://dev.tailsweep.com/solr-external-scoring/en/

/M

On Sat, Apr 25, 2009 at 12:01 AM, Marcus Herou
marcus.he...@tailsweep.comwrote:

 Works like a charm!

 Thank you sir.

 //Marcus


 On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou marcus.he...@tailsweep.com
  wrote:

 That is fantastic, I am creating a really small index right now trying to
 figure out howto implement the FunctionQuery for this.

 //Marcus


 On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley 
 yo...@lucidimagination.com wrote:

 On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
 marcus.he...@tailsweep.com wrote:
  Great! That seems like something that could work. Depends on how that
 field
  get's re-read/indexed I guess.


 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

 It's a separate *text* file that just contains id/value pairs for a
 field.
 Calculate your custom score and save it to that file.  Then call
 commit so the file is re-read and all your scores will be updated (and
 usable in a function query).

 So the short answer is, you should be able to do what you want with no
 Solr customization in Java... it's all built in.

 -Yonik
 http://www.lucidimagination.com




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Works like a charm!

Thank you sir.

//Marcus

On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou
marcus.he...@tailsweep.comwrote:

 That is fantastic, I am creating a really small index right now trying to
 figure out howto implement the FunctionQuery for this.

 //Marcus


 On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
 marcus.he...@tailsweep.com wrote:
  Great! That seems like something that could work. Depends on how that
 field
  get's re-read/indexed I guess.


 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

 It's a separate *text* file that just contains id/value pairs for a field.
 Calculate your custom score and save it to that file.  Then call
 commit so the file is re-read and all your scores will be updated (and
 usable in a function query).

 So the short answer is, you should be able to do what you want with no
 Solr customization in Java... it's all built in.

 -Yonik
 http://www.lucidimagination.com




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Cool!

GET '
http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boostb=blogRank
v=$qq}qq=title:solrdebugQuery=on'

On Sat, Apr 25, 2009 at 12:43 AM, Marcus Herou
marcus.he...@tailsweep.comwrote:

 That seems wise... PageRank * Text-based Scoring.

 So you mean in my stupid case that:
 GET '
 http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boosthttp://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=%7B%21boostb=blogRank
  v=$qq}qq=*:*'
 would yield the same results as:
 GET 
 http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=*:*_val_:\log(blogRank)\

 since I have no text data

 but if I introduce a tokenized textfield (title).
 Example:

 add
 docfield name=blogId1/fieldfield name=titlesolr solr 
 solr/field/doc

 docfield name=blogId2/fieldfield name=titlesolr/field/doc
 /add

 where blogId=1 had blogRank of 1
 where blogId=2 had blogRank of 2
 and if I searched for solr
 GET '
 http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boosthttp://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q=%7B%21boostb=blogRank
  v=$qq}qq=title:solr'

 I might get blogId=1 as nr1 in the results even though it had lower
 blogRank due to the higher frequency of the term solr ?

 Did I understand this correctly ?

 //Marcus



 On Sat, Apr 25, 2009 at 12:07 AM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 You probably want to mix the custom score with the normal relevancy
 score... to add, use a normal boolean query.  To multiply, check out
 boosted query:

 http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html

 For other options, use a more complex function query with the new
 query() capability (need to use 1.4 trunk for that though).

 Examples:

 q={!boost b=myScore v=$qq}qq=my normal lucene query
  OR for a dismax relevancy query,
 q={!boost b=myScore v=$qq}qq={!dismax qf=text_all pf=text_all}solr rocks

 If the {! type of syntax looks new, check out
 http://wiki.apache.org/solr/LocalParams
 powerful stuff!

 -Yonik
 http://www.lucidimagination.com




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Meant this part:
m...@mahe-laptop:~$ GET '
http://127.0.0.1:8110/solr/test/select?indent=onstart=0rows=100q={!boostb=blogRank
v=$qq}qq=title:solrdebugQuery=on'
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
 int name=status0/int
 int name=QTime121/int
 lst name=params
  str name=qqtitle:solr/str
  str name=start0/str
  str name=indenton/str
  str name=q{!boost b=blogRank v=$qq}/str
  str name=debugQueryon/str
  str name=rows100/str
 /lst
/lst
result name=response numFound=5 start=0
 doc
  int name=blogId3/int
  str name=pkId4/str
 /doc
 doc
  int name=blogId1/int
  str name=pkId1/str
 /doc
 doc
  int name=blogId1/int
  str name=pkId2/str
 /doc
 doc
  int name=blogId4/int
  str name=pkId5/str
 /doc
 doc
  int name=blogId2/int
  str name=pkId3/str
 /doc
/result
lst name=debug
 str name=rawquerystring{!boost b=blogRank v=$qq}/str
 str name=querystring{!boost b=blogRank v=$qq}/str
 str
name=parsedqueryBoostedQuery(boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)))/str
 str
name=parsedquery_toStringboost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/))/str
 lst name=explain
  str name=4
5.7488723 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.9162908 = (MATCH) fieldWeight(title:solr in 13), product of:
2.0 = tf(termFreq(title:solr)=4)
1.9162908 = idf(docFreq=5, numDocs=5)
0.5 = fieldNorm(field=title, doc=13)
  3.0 = float(blogRank{type=blogRankFile,properties=})=3.0
/str
  str name=1
3.8325815 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.9162908 = (MATCH) fieldWeight(title:solr in 10), product of:
1.0 = tf(termFreq(title:solr)=1)
1.9162908 = idf(docFreq=5, numDocs=5)
1.0 = fieldNorm(field=title, doc=10)
  2.0 = float(blogRank{type=blogRankFile,properties=})=2.0
/str
  str name=2
3.3875556 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.6937778 = (MATCH) fieldWeight(title:solr in 11), product of:
1.4142135 = tf(termFreq(title:solr)=2)
1.9162908 = idf(docFreq=5, numDocs=5)
0.625 = fieldNorm(field=title, doc=11)
  2.0 = float(blogRank{type=blogRankFile,properties=})=2.0
/str
  str name=5
1.8746685 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.8746685 = (MATCH) fieldWeight(title:solr in 14), product of:
2.236068 = tf(termFreq(title:solr)=5)
1.9162908 = idf(docFreq=5, numDocs=5)
0.4375 = fieldNorm(field=title, doc=14)
  1.0 = float(blogRank{type=blogRankFile,properties=})=1.0
/str
  str name=3
1.6595565 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.6595565 = (MATCH) fieldWeight(title:solr in 12), product of:
1.7320508 = tf(termFreq(title:solr)=3)
1.9162908 = idf(docFreq=5, numDocs=5)
0.5 = fieldNorm(field=title, doc=12)
  1.0 = float(blogRank{type=blogRankFile,properties=})=1.0
/str
 /lst
 str name=QParserLuceneQParser/str
 str name=boost_strblogRank/str
 str
name=boost_parsedorg.apache.solr.search.function.FileFloatSource:FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)/str
 lst name=timing
  double name=time106.0/double
  lst name=prepare
double name=time55.0/double
lst name=org.apache.solr.handler.component.QueryComponent
 double name=time53.0/double
/lst
lst name=org.apache.solr.handler.component.FacetComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.HighlightComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.StatsComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.DebugComponent
 double name=time0.0/double
/lst
  /lst
  lst name=process
double name=time50.0/double
lst name=org.apache.solr.handler.component.QueryComponent
 double name=time42.0/double
/lst
lst name=org.apache.solr.handler.component.FacetComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.HighlightComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.StatsComponent
 double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.DebugComponent

PageRank sort

2009-04-23 Thread Marcus Herou
Hi.

I've posted before but here it goes again:

I have BlogData data which is more or less 100% static but one field is not
- the PageRank.
I would like to sort on that field and on the Lucene list I got these
answers.

1. Use two indexes and a ParallellReader
2. Use a FieldScoreQuery containing the PageRank field.
3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with other
Queries (the actual search).

I think I could use this pattern as well:
1. Use two indexes and a ParallellReader
2. Normal search and Sort on the PageRank column (perhaps consuming more
memory)

Anyone have an idea of howto implement these patterns in SOLR ?
I have never extended SOLR but am not afraid of doing so if someone pushes
me in the right direction.

Kindly

//Marcus




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/