Re: Nutch nightly build and NUTCH-505 draft patch

2007-07-11 Thread Doğacan Güney

Hi,

On 7/2/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

Recently I successully applied applied NUTCH-505_draft_v2.patch as follows:

$ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
$ cd nutch
$ wget 
https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
 --no-check-certificate
$ sudo patch -p0  NUTCH-505_draft_v2.patch
$ ant clean
$ ant

However, I also needed other recent nutch functionality, so I downloaded a 
nightly build:

$ wget 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz

I then attempted to apply the patch to that build using the successive steps.  I was able to run 
ant clean but ant failed with

build.xml:61: Specify at least one source--a file or resource collection

Do I need to get a source checkout of a nightly build?  How would I do that?



Once you checkout nutch trunk with svn checkout, you can use svn
up to get the latest code changes. You can also use svn st -u which
compares your local version against trunk and shows you what changed.





Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/



--
Doğacan Güney


[jira] Resolved: (NUTCH-505) Outlink urls should be validated

2007-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-505.
-

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Doğacan Güney

Committed in rev. 555237.

 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
 NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop

2007-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-506:


Attachment: NUTCH-506.patch

New version. I missed ProtocolStatus and ParseStatus. This patch updates them 
in a backward-compatible way.

 Nutch should delegate compression to Hadoop
 ---

 Key: NUTCH-506
 URL: https://issues.apache.org/jira/browse/NUTCH-506
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: compress.patch, NUTCH-506.patch


 Some data structures within nutch (such as Content, ParseText) handle their 
 own compression. We should delegate all compressions to Hadoop. 
 Also, nutch should respect io.seqfile.compression.type setting. Currently 
 even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it 
 for some structures and sets it to NONE (However, IMO, ParseText should 
 always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: OPIC scoring differences

2007-07-11 Thread Doğacan Güney

On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Carl Cerecke wrote:
 Hi,

 The docs for the OPICScoringFilter mention that the plugin implements a
 variant of OPIC from Artiboul et al's paper. What exactly is different?
 How does the difference affect the scores?

As it is now, the implementation doesn't preserve the total cash value
in the system, and also there is almost no smoothing between the
iterations (Abiteboul's history).

As a consequence, scores may (and do) vary dramatically between
iterations, and they don't converge to stable values, i.e. they always
increase. For pages that get a lot of score contributions from other
pages this leads to an explosive increase into the range of thousands or
eventually millions. This means that the scores produced by the OPIC
plugin exaggerate score differences between pages more and more, even if
the web graph that you crawl is stable.

In a sense, to follow the cash analogy, our implementation of OPIC
illustrates a runaway economy - galloping inflation, rich get richer and
poor get poorer ;)


 Also, there's a comment in the code:

 // XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
 // XXX in the paper, where page loses its score if it's distributed to
 // XXX linked pages...

 Is this something that will be looked at eventually or is the scoring
 good enough at the moment without some adjustment.

Yes, I'll start working on it when I get back from vacations. I did some
simulations that show how to fix it (see
http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).


Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks.
- At this time, it is possible that page Pk links to page A.
So, now Page A's cash  0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
- Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.
- I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)

Or am I missing something here?



--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney


[jira] Resolved: (NUTCH-510) IndexMerger delete working dir

2007-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-510.
-

Resolution: Fixed
  Assignee: Doğacan Güney

Committed in rev. 555307 with style modifications. I also removed two useless 
log guards in IndexMerger.

 IndexMerger delete working dir
 --

 Key: NUTCH-510
 URL: https://issues.apache.org/jira/browse/NUTCH-510
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: index.merger.delete.temp.dirs.patch


 IndexMerger does not delete the working dir when an IOException is thrown 
 such as No space left on device. Local temporary directories should be 
 deleted. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-510) IndexMerger delete working dir

2007-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-510.
---


Issue resolved and committed.

 IndexMerger delete working dir
 --

 Key: NUTCH-510
 URL: https://issues.apache.org/jira/browse/NUTCH-510
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: index.merger.delete.temp.dirs.patch


 IndexMerger does not delete the working dir when an IOException is thrown 
 such as No space left on device. Local temporary directories should be 
 deleted. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: OPIC scoring differences

2007-07-11 Thread Andrzej Bialecki

Doğacan Güney wrote:


Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their 
outlinks.

- At this time, it is possible that page Pk links to page A.
So, now Page A's cash  0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
- Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.


Yes. That's why Abiteboul used history (several cycles long) to smooth 
out temporary imbalances in cache redistribution. The history component 
described in the paper could be either several cycles long, or specific 
period of time long.


In our case I think the history for rarely updated pages should span the 
db.max.interval period plus some, and for frequently updated pages it 
should span several cycles.



- I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)


Actually we store outlinks in two places - one place is obviously the 
segments. The other less obvious place is the linkdb - the data is 
there, it just needs to be inverted (again).


So, theoretically, we could modify the updatedb process to consider the 
complete webgraph, i.e. all link information collected so far - but the 
main attractiveness of OPIC is that it's incremental, so that you don't 
have to consider the whole webgraph with small incremental updates.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com