[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378755 ] 

Andrzej Bialecki  commented on NUTCH-267:
-

I would argue that what Nutch implements now shouldn't be called OPIC, because 
it has little to do with the algorithm described in the OPIC paper. Either we 
fix it, or we should rename it. Let me explain:

* the paper uses a cash flow concept, where nodes not only receive score 
contributions, but also give them away thus _reducing_ their available score. 
This is not implemented in Nutch, which leads to scores growing into infinity. 
This also makes the score dependent on the number of fetch cycles, i.e. the 
scores of two pages with exactly the same inlinks will be different if one of 
them underwent more refresh cycles than the other. So, the fundamental premise 
of the algorithm - that scores would converge to certain values as a result of 
cash flow balance - is not retained.

* the paper uses a concept of virtual nodes that give away cash to 
disconnected nodes in the current graph. In reality, these nodes are probably 
connected, but the current graph is not complete enough to track it. The Nutch 
implementation doesn't use this, but only because it doesn't give away cash.

* finally, the paper argues that OPIC score and other different scores should 
be combined as a sum of logarithms, i.e. log(opic) + log(docSimilarity). 
Nutch uses a formula sqrt(opic) * docSimilarity (through document boosting).

I'm going to commit the scoring API soon, this should make it easier to 
experiment with different scoring models.

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Creating different binary databases for indexing

2006-05-09 Thread Dennis Kubes
I am working on a boosting solutiong where I am having to create more 
binary databases than just the linkdb, crawldb, etc.  For example I 
create one for uncommon words in a page.  Then I want to use these 
database objects inside of the indexing process, in the filters, by key 
along with the linkdb, parse text ,parse data and so on. 

The link database and parse text and data are passed into the filters 
directly through the filter interface.  I can't pass other databases 
alongside because I would have to change the interface which means I 
would have to refactor all existing indexing filters.  The easiest way I 
found right now in modifying the parse interface to also hold the 
database objects that I need, but that doesn't feel like a good long 
term solution.


Is there a better way to pass other keyed values (database) objects into 
the indexing filters?  Should we start a discussion about if we need 
this functionality in Nutch and how best to implement it.  I would be 
happy to implement it but I want some discussion and opinions first.


Dennis


PATCH - Fixes for 0.8 tutorial

2006-05-09 Thread Lukas Vlcek

Hi,

I reported some typos and incomplete information in nutch 08 tutorial
some time ago. It seems that all commiters and voluntaries are busy
with more important issues so I took this opportunity and now I am
proud to present my *first-small-humble-patch-ever*.

Please review the patch and let me know what should I do better the next time.
Note that I made checkout of release-0.7.2 branch (as I found that the
source file for the 0.8 tutorial is located here) and generated SVN
patch after modification. Thus there is absolute file path from my
computer in the patch header (I am not SVN expert - any advice
welcomed).

Also I added dedup and merge commands examples into tutorial as well.
Feel free to remove it if you don't think this fits with original
tutorial intend.

Regards,
Lukas
Index: 
/home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml
===
--- 
/home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml
(revision 405528)
+++ 
/home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml
(working copy)
@@ -243,16 +243,19 @@
 pBefore indexing we first invert all of the links, so that we may
 index incoming anchor text with the pages./p
 
-sourcebin/nutch invertlinks crawl/linkdb crawl/segments/source
+sourcebin/nutch invertlinks crawl/linkdb -dir crawl/segments/source
 
 pTo index the segments we use the codeindex/code command, as follows:/p
 
-sourcebin/nutch index indexes crawl/linkdb crawl/segments/*/source
+sourcebin/nutch index crawl/indexes crawl/crawldb crawl/linkdb 
crawl/segments/*/source
+
+pThen, we need to delete duplicate pages. This is done with:/p
 
-!-- pThen, before we can search a set of segments, we need to delete --
-!-- duplicate pages.  This is done with:/p --
+sourcebin/nutch dedup crawl/indexes/source
 
-!-- sourcebin/nutch dedup indexes/source --
+pIn the end we merge all individual indexes into one index:/p
+
+sourcebin/nutch merge crawl/index crawl/indexes/source
 
 pNow we're ready to search!/p
 


Re: Creating different binary databases for indexing

2006-05-09 Thread Andrzej Bialecki

Dennis Kubes wrote:
I am working on a boosting solutiong where I am having to create more 
binary databases than just the linkdb, crawldb, etc.  For example I 
create one for uncommon words in a page.  Then I want to use these 
database objects inside of the indexing process, in the filters, by 
key along with the linkdb, parse text ,parse data and so on.
The link database and parse text and data are passed into the filters 
directly through the filter interface.  I can't pass other databases 
alongside because I would have to change the interface which means I 
would have to refactor all existing indexing filters.  The easiest way 
I found right now in modifying the parse interface to also hold the 
database objects that I need, but that doesn't feel like a good long 
term solution.


Is there a better way to pass other keyed values (database) objects 
into the indexing filters?  Should we start a discussion about if we 
need this functionality in Nutch and how best to implement it.  I 
would be happy to implement it but I want some discussion and opinions 
first.


I'm not sure if I understood all your requirements.. Anyway. You can 
pass arbitrary Writable objects to Indexer map() and reduce(), because 
they will be wrapped into ObjectWritable. In particular, you could pass 
some data retrieved from an input file (using SequenceFileInputFormat), 
if you stored your database values previously in such file. Or you could 
stick the primary key to the DB record inside CrawlDatum.metaData, and 
then retrieve record data from the DB during reduce ...


All of the above you can accomplish without changing any of the 
interfaces, just by adding properly formatted data to the input, and 
then using an indexing plugin that can consume this particular type of 
input data.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Creating different binary databases for indexing

2006-05-09 Thread Dennis Kubes
I am doing that and I have changed Indexer to retrieve the 
ObjectWritable just as it does with the Inlinks and CrawlDb.  But my 
problem is that those objects are passed into the indexing filters 
directly (well parse text and data are wrapped in parse, but it still 
goes in directly).  What if I want to pass another object to the 
filters?  What would be a good way to do that without changing the 
IndexingFilter interface?


Dennis

Andrzej Bialecki wrote:

Dennis Kubes wrote:
I am working on a boosting solutiong where I am having to create more 
binary databases than just the linkdb, crawldb, etc.  For example I 
create one for uncommon words in a page.  Then I want to use these 
database objects inside of the indexing process, in the filters, by 
key along with the linkdb, parse text ,parse data and so on.
The link database and parse text and data are passed into the filters 
directly through the filter interface.  I can't pass other databases 
alongside because I would have to change the interface which means I 
would have to refactor all existing indexing filters.  The easiest 
way I found right now in modifying the parse interface to also hold 
the database objects that I need, but that doesn't feel like a good 
long term solution.


Is there a better way to pass other keyed values (database) objects 
into the indexing filters?  Should we start a discussion about if we 
need this functionality in Nutch and how best to implement it.  I 
would be happy to implement it but I want some discussion and 
opinions first.


I'm not sure if I understood all your requirements.. Anyway. You can 
pass arbitrary Writable objects to Indexer map() and reduce(), because 
they will be wrapped into ObjectWritable. In particular, you could 
pass some data retrieved from an input file (using 
SequenceFileInputFormat), if you stored your database values 
previously in such file. Or you could stick the primary key to the DB 
record inside CrawlDatum.metaData, and then retrieve record data from 
the DB during reduce ...


All of the above you can accomplish without changing any of the 
interfaces, just by adding properly formatted data to the input, and 
then using an indexing plugin that can consume this particular type of 
input data.




Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-09 Thread Lukas Vlcek

Andrzej,

My pleasure. I would choose the following location:
http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
Let me know if you can think of anything better otherwise I'll do it.

Regards,
Lukas

On 5/9/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Lukas Vlcek wrote:
 Andrzej,
 Thanks for your effort!

 Are you goigng to post tool descriptions somewhere on the wiki or
 tutorial? It would be great if this information could be available to
 people out of the dev-mail list as well.

If you have some spare cycles, would you be willing to do this? Take
excerpts from my email and from the Javadoc - I tried to make especially
the Javadoc as complete as possible...

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: Creating different binary databases for indexing

2006-05-09 Thread Andrzej Bialecki

Dennis Kubes wrote:
I am doing that and I have changed Indexer to retrieve the 
ObjectWritable just as it does with the Inlinks and CrawlDb.  But my 
problem is that those objects are passed into the indexing filters 
directly (well parse text and data are wrapped in parse, but it still 
goes in directly).  What if I want to pass another object to the 
filters?  What would be a good way to do that without changing the 
IndexingFilter interface?


Use CrawlDatum.metaData. You can put any Writable there.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] 

Doug Cutting commented on NUTCH-267:


Andrzej: your analysis is correct, but it mostly only applies when re-crawling. 
 In an initial crawl, where each url is fetched only once, I think we implement 
 the OPIC Greedy strategy.  The question of what to do when re-crawling has 
not been adequately answered, but, glancing at the paper, it seems that 
resetting a urls score to zero each time it is fetched might be the best thing 
to do, so that it can start accumulating more cash.

When ranking, summing logs is the same as multiplying, no?

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira