date:20060508

nutch is loosing not modified pages

2006-05-08 Thread Stefan Groschupf


Hi,

in the fetcher line 192 in case the status is NOTMODIFIED we collect   
null as content but we already have the content.
I'm worry what is happen with a page that does not change for 60  
days, since the concept of nutch is do delete segments that are older  
than db.default.fetch.interval, isn't it?


If this is true, may be someone with write access can change null to  
content.

Thanks for any comments.
Stefan

Re: nutch is loosing not modified pages

2006-05-08 Thread Andrzej Bialecki


Stefan Groschupf wrote:

Hi,

in the fetcher line 192 in case the status is NOTMODIFIED we collect  
null as content but we already have the content.
I'm worry what is happen with a page that does not change for 60 days, 
since the concept of nutch is do delete segments that are older than 
db.default.fetch.interval, isn't it?


If this is true, may be someone with write access can change null to 
content.


This requires a more systematic approach, which is a part of the 
adaptive fetch patch. In that patch pages which are older than maximum 
fetch interval (a system-wide setting) will be forced on the fetchlist, 
no matter what their state. This also ensures that pages in the GONE 
state are checked from time to time.


I'll be working on this patch next week, with the goal of committing it, 
and I could use some testing and code review then ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Dawid Weiss (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] 

Dawid Weiss commented on NUTCH-134:
---

(back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a 
plain-text only summarized is ideal for clustering for example. HTML is quite 
uncomfortable to work with.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: summarizer.060506.patch

 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: http chunked content

2006-05-08 Thread Jérôme Charron


As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?


In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a future
use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[jira] Created: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Kris K (JIRA)

Getting Clustered results in better form.
-

 Key: NUTCH-265
 URL: http://issues.apache.org/jira/browse/NUTCH-265
 Project: Nutch
Type: Improvement

  Components: searcher  
Versions: 0.7.2
Reporter: Kris K


The cluster results are coming with title and link to URL. For improvement it 
should be clustered keyword phrases (Like  Vivisimo type). Any person can share 
their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: Merging segments

2006-05-08 Thread Andrzej Bialecki


Chris Fellows wrote:

Hello,

So the last discussion on merging segments was back in
Jan. Has there been any progress in this direction?
What would be the benefit of being able merge
segments? Would being able to merge segments open up
new functionality options or is merging just a
convience? Also, what's the estimate for how involved
merge functionality development is?
  


Relief is on the way. Fine folks at houxou.com have sponsored the 
development of a brand-new SegmentMerger + slicer, and decided to donate 
it to the project - big thanks!


I'm running some final tests, and will commit it today/tomorrow.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Dawid Weiss (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] 

Dawid Weiss commented on NUTCH-265:
---

The clustering interface is very simple in Nutch because it usually needs to be 
adjusted to the needs of a particular application. Maintaing a complex user 
interface is not among Nutch's objectives, so I doubt if it's possible. 
Carrot2, which Nutch internally uses, has a JavaScript-powered interface which 
could be added to Nutch if there are folks that really think it is worth the 
effort.

See this one:
http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=nutchprocessingChain=carrot2.process.lingo-yahooapiresultsRequested=100

 Getting Clustered results in better form.
 -

  Key: NUTCH-265
  URL: http://issues.apache.org/jira/browse/NUTCH-265
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.7.2
 Reporter: Kris K


 The cluster results are coming with title and link to URL. For improvement it 
 should be clustered keyword phrases (Like  Vivisimo type). Any person can 
 share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: http chunked content

2006-05-08 Thread Stefan Groschupf


I'm almost sure that this is not related to http 1.0 requests.

Am 08.05.2006 um 03:20 schrieb Jérôme Charron:


As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?


In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a  
future

use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] 

Doug Cutting commented on NUTCH-134:


+1 for Summary as Writable and change HitSummarizer.getSummary() to return a 
Summary directly rather than a String.  I don't think this has bad performance 
implications.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: summarizer.060506.patch

 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: http chunked content

2006-05-08 Thread Stefan Groschupf

http://www.apple.com for example answer with chunked content also if  
you request with a http 1.0 header.


Am 08.05.2006 um 03:20 schrieb Jérôme Charron:


As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?


In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a  
future

use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Merging segments

2006-05-08 Thread Chris Fellows

That's great.

Well, my follow up to that then is: 

Will the new tool allow any form of diff'ing
segments? In practice this would allow you to run a
crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on changes within
the search domain.

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Chris Fellows wrote:
  Hello,
 
  So the last discussion on merging segments was
 back in
  Jan. Has there been any progress in this
 direction?
  What would be the benefit of being able merge
  segments? Would being able to merge segments open
 up
  new functionality options or is merging just a
  convience? Also, what's the estimate for how
 involved
  merge functionality development is?

 
 Relief is on the way. Fine folks at houxou.com have
 sponsored the 
 development of a brand-new SegmentMerger + slicer,
 and decided to donate 
 it to the project - big thanks!
 
 I'm running some final tests, and will commit it
 today/tomorrow.
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com

Re: Merging segments

2006-05-08 Thread Andrzej Bialecki


Chris Fellows wrote:

That's great.

Well, my follow up to that then is: 


Will the new tool allow any form of diff'ing
segments? In practice this would allow you to run a
  


No, it does only two things - merging and slicing. That's already one 
too many... ;)



crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on changes within
the search domain.
  


Interesting concept, but I think it would be better implemented as a 
variant of de-duplication, rather than segment content manipulation.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: http chunked content

2006-05-08 Thread Chris Fellows

 Furthermore, we can read in HTTP/1.1 specification
 that A server MUST NOT
 send
 transfer-codings to an HTTP/1.0 client.

I once did an socket implementation against
Anonymizer. This is well established proxy service
that services $100K+ government and private contracts.

Their server always sent chunked content despite all
headers. I'm pretty sure that there are other well
established servers that send chunked content despite
the rfc.

Guessing that it might have something to do with
wanting to control content compression. All the
browsers can handle it, and that's probably all apple
is concerned with - even though they're overriding an
rfc spec req.

Chris

--- Jérôme Charron [EMAIL PROTECTED] wrote:

  http://www.apple.com for example answer with
 chunked content also if
  you request with a http 1.0 header.
 
 
 Stefan,
 
 I don't see any Transfer-Encoding: chunked header
 in responses from
 www.apple.com
 Furthermore, we can read in HTTP/1.1 specification
 that A server MUST NOT
 send
 transfer-codings to an HTTP/1.0 client.
 
 Jérôme
 
 --
 http://motrech.free.fr/
 http://www.frutch.org/

Re: http chunked content

2006-05-08 Thread Chris Fellows

Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.

--- Chris Fellows [EMAIL PROTECTED] wrote:

  Furthermore, we can read in HTTP/1.1 specification
  that A server MUST NOT
  send
  transfer-codings to an HTTP/1.0 client.
 
 I once did an socket implementation against
 Anonymizer. This is well established proxy service
 that services $100K+ government and private
 contracts.
 
 Their server always sent chunked content despite all
 headers. I'm pretty sure that there are other well
 established servers that send chunked content
 despite
 the rfc.
 
 Guessing that it might have something to do with
 wanting to control content compression. All the
 browsers can handle it, and that's probably all
 apple
 is concerned with - even though they're overriding
 an
 rfc spec req.
 
 Chris
 
 --- Jérôme Charron [EMAIL PROTECTED] wrote:
 
   http://www.apple.com for example answer with
  chunked content also if
   you request with a http 1.0 header.
  
  
  Stefan,
  
  I don't see any Transfer-Encoding: chunked
 header
  in responses from
  www.apple.com
  Furthermore, we can read in HTTP/1.1 specification
  that A server MUST NOT
  send
  transfer-codings to an HTTP/1.0 client.
  
  Jérôme
  
  --
  http://motrech.free.fr/
  http://www.frutch.org/

[jira] Created: (NUTCH-266) hadoop bug when doing updatedb

2006-05-08 Thread Eugen Kochuev (JIRA)

hadoop bug when doing updatedb
--

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
Type: Bug

Versions: 0.8-dev
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev


I constantly get the following error message

060508 230637 Running job: job_pbhn3t
060508 230637 
c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
060508 230637 
c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
060508 230637 
c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
060508 230637 job_pbhn3t
java.io.IOException: Target 
/tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
at 
org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: http chunked content

2006-05-08 Thread Andrzej Bialecki


Chris Fellows wrote:

Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.

  


Take a look at protocol-httpclient. This discussion is on whether/how to 
fix protocol-http. The other plugin already supports this.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Closed: (NUTCH-264) Tools for merging and filtering CrawlDb-s and LinkDb-s

2006-05-08 Thread Andrzej Bialecki (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-264?page=all ]
 
Andrzej Bialecki  closed NUTCH-264:
---

Resolution: Fixed

A version of this patch was included in rev. 405183

 Tools for merging and filtering CrawlDb-s and LinkDb-s
 --

  Key: NUTCH-264
  URL: http://issues.apache.org/jira/browse/NUTCH-264
  Project: Nutch
 Type: New Feature

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch.txt

 This patch contains implementations and unit tests for two new commands:
 * mergedb: merges one or more CrawlDb-s, optionally filtering urls through 
 the current URLFilters.
 * mergelinkdb: as above, only for LinkDb-s. Optional filtering is applied 
 both to toUrls and fromUrls in Inlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-263) MapWritable.equals() doesn't work properly

2006-05-08 Thread Andrzej Bialecki (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-263?page=all ]
 
Andrzej Bialecki  closed NUTCH-263:
---

Resolution: Fixed

Patch applied in rev. 405179. If further improvements are needed please re-open 
this issue.

 MapWritable.equals() doesn't work properly
 --

  Key: NUTCH-263
  URL: http://issues.apache.org/jira/browse/NUTCH-263
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch1.txt

 MapWritable.equals() is sensitive to the order in which map entries have been 
 created. E.g. this fails but it should succeed:
 MapWritable map1 = new MapWritable();
 MapWritable map2 = new MapWritable();
 map1.put(new UTF8(key1), new UTF8(val1));
 map1.put(new UTF8(key2), new UTF8(val2));
 map2.put(new UTF8(key2), new UTF8(val2));
 map2.put(new UTF8(key1), new UTF8(val1));
 assertTrue(map1.equals(map2));
 Users expect that this should not be the case, i.e. this class should follow 
 the same rules as Map.equals() (Returns true if the given object is also a 
 map and the two Maps represent the same mappings).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: http chunked content

2006-05-08 Thread Chris Fellows

Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:

A server MUST NOT send transfer-codings to an
HTTP/1.0 client.

Regardless, I can attest that there are servers out
there that return chunked content regardless of the
client.

We had a socket implementation akin to the
HttpResponse.java in http-protocol plugin and were
stumped on how to handle identifying whether the
response was chunked or not - as we could not reliably
use the Transfer-coding header. The only way we could
see was trying to use the initial hex characters
denoting the size of the first chunk.

The chunk-size field is a string of hex digits
indicating the size of the chunk. The chunked encoding
is ended by any chunk whose size is zero, followed by
the trailer, which is terminated by an empty line. -
more from RFC 2616

But in practice this was error prone. Switching over
to apache httpclient eliminated this problem, as it
transparently handles chunked and un-chunked content.
But httpclient is much more heavy weight and so the
conversion could only be done after implementing some
basic resource pooling on the primary httpclient
object. 

It does look like this would be a serious refactor job
as nutch uses all java.net classes. On the other hand,
it might simplify some areas of the nutch protocol
classes and httpclient does have some interesting
built in support for multi-threading/performance
tuning requests.

I hope this helps towards a solution.

Best Regards,

Chris

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Chris Fellows wrote:
  Just remembered, got around it by using HTTPClient
  which handles reading the response (chunked or
 not)
  transparently. Haven't looked at the nutch code,
 but
  if we were to use HTTPClient 3.0.x or later,
 should
  take care of it.
 

 
 Take a look at protocol-httpclient. This discussion
 is on whether/how to 
 fix protocol-http. The other plugin already supports
 this.
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com

[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Chris Schneider (JIRA)

Indexer doesn't consider linkdb when calculating boost value


 Key: NUTCH-267
 URL: http://issues.apache.org/jira/browse/NUTCH-267
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: Chris Schneider
Priority: Minor


Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
indexer.boost.by.link.count was true, the indexer boost value was scaled based 
on the log of the # of inbound links:

if (boostByLinkCount)
  res *= (float)Math.log(Math.E + linkCount);

This is no longer true (even before Andrzej implemented scoring filters). 
Instead, the boost value is just the square root (or some other scorePower) of 
the page score. Shouldn't the invertlinks command, which creates the linkdb, 
have some affect on the boost value calculated during indexing (either via the 
OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] 

Doug Cutting commented on NUTCH-267:


The OPIC score is much like a count of incoming links, but a bit more refined.  
OPIC(P) is one plus the sum of the OPIC contributions for all links to a page.  
The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P).

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

nutch is loosing not modified pages

Re: nutch is loosing not modified pages

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Re: http chunked content

[jira] Created: (NUTCH-265) Getting Clustered results in better form.

Re: Merging segments

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

Re: http chunked content

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Re: http chunked content

Re: Merging segments

Re: Merging segments

Re: http chunked content

Re: http chunked content

[jira] Created: (NUTCH-266) hadoop bug when doing updatedb

Re: http chunked content

[jira] Closed: (NUTCH-264) Tools for merging and filtering CrawlDb-s and LinkDb-s

[jira] Closed: (NUTCH-263) MapWritable.equals() doesn't work properly

Re: http chunked content

[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

21 matches

Site Navigation

Mail list logo

Footer information