[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

2009-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666045#action_12666045
 ] 

Hudson commented on NUTCH-579:
--

Integrated in Nutch-trunk #701 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/701/])
 - Feed plugin only indexes one post per feed due to identical digest


 Feed plugin only indexes one post per feed due to identical digest
 --

 Key: NUTCH-579
 URL: https://issues.apache.org/jira/browse/NUTCH-579
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Joseph Chen
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-579.patch


 When parsing an rss feed, only one post will be indexed per feed.  The reason 
 for this is that the digest, which is calculated for based on the content (or 
 the url if the content is null) is always the same for each post in a feed.
 I noticed this when I was examining my lucene indexes using Luke.  All of the 
 individual feed entries were being indexed properly but then when the dedup 
 step ran, my merged index ended up with only one document.
 As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
 by adding the following code to the filter function:
 byte[] signature = MD5Hash.digest(url.toString()).getDigest();
 doc.removeField(digest);
 doc.add(new Field(digest, StringUtil.toHexString(signature), 
 Field.Store.YES, Field.Index.NO));
 This seems to fix the issue as the index now contains the proper number of 
 documents.
 Anyone have any comments on whether this is a good solution or if there is a 
 better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

2009-01-09 Thread Yury (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12662490#action_12662490
 ] 

Yury commented on NUTCH-579:


Hi!

I have the same problem with feed perser. I crawl livejournal feed and 
FeddParser pars it. ParseResult contains all items of chanel but index contains 
only chanel header. Joseph's solution unfortunately don't work, 
FeedIndexingFilter process only chanels header.

 Feed plugin only indexes one post per feed due to identical digest
 --

 Key: NUTCH-579
 URL: https://issues.apache.org/jira/browse/NUTCH-579
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Joseph Chen

 When parsing an rss feed, only one post will be indexed per feed.  The reason 
 for this is that the digest, which is calculated for based on the content (or 
 the url if the content is null) is always the same for each post in a feed.
 I noticed this when I was examining my lucene indexes using Luke.  All of the 
 individual feed entries were being indexed properly but then when the dedup 
 step ran, my merged index ended up with only one document.
 As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
 by adding the following code to the filter function:
 byte[] signature = MD5Hash.digest(url.toString()).getDigest();
 doc.removeField(digest);
 doc.add(new Field(digest, StringUtil.toHexString(signature), 
 Field.Store.YES, Field.Index.NO));
 This seems to fix the issue as the index now contains the proper number of 
 documents.
 Anyone have any comments on whether this is a good solution or if there is a 
 better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

2007-12-18 Thread Joseph Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935
 ] 

Joseph Chen commented on NUTCH-579:
---

I changed the db.signature.class and this seems to solve the problem when I 
first do a crawl.

Now I'm seeing a similar problem when I try to merge the results of two crawls. 
 I performed two separate crawls using the crawl tool.  I wanted to merge the 
results of the two crawls.  Here are the steps I did:

1) Merged the segments from the two crawls
2) Inverted links
3) Merged the crawldb
4) Indexed the segments
5) Dedup the index
6) Merged the indexes

I noticed a problem after running the dedup.  My original index had about 8000 
documents (corresponding to feed posts) and after merging I ended up with about 
half that number (4000 documents).

Examining the index via Luke shows that I'm back down to one post feed - each 
document has a unique digest value. 
When I skip the dedup step (step 5), the number of documents is around 17000, 
and examining this index shows multiple posts from a feed.

I searched for the db.signature.class value in the DeleteDuplicates.java class, 
which is the class that gets called when running bin/nutch dedup, but I didn't 
see any references to this value.

Any ideas about this issue?

 Feed plugin only indexes one post per feed due to identical digest
 --

 Key: NUTCH-579
 URL: https://issues.apache.org/jira/browse/NUTCH-579
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Joseph Chen

 When parsing an rss feed, only one post will be indexed per feed.  The reason 
 for this is that the digest, which is calculated for based on the content (or 
 the url if the content is null) is always the same for each post in a feed.
 I noticed this when I was examining my lucene indexes using Luke.  All of the 
 individual feed entries were being indexed properly but then when the dedup 
 step ran, my merged index ended up with only one document.
 As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
 by adding the following code to the filter function:
 byte[] signature = MD5Hash.digest(url.toString()).getDigest();
 doc.removeField(digest);
 doc.add(new Field(digest, StringUtil.toHexString(signature), 
 Field.Store.YES, Field.Index.NO));
 This seems to fix the issue as the index now contains the proper number of 
 documents.
 Anyone have any comments on whether this is a good solution or if there is a 
 better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

2007-11-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544383
 ] 

Doğacan Güney commented on NUTCH-579:
-

Joseph, good point. Plugin parse-feed is meant to be used with 
TextProfileSignature or any other signature implementation that uses parse text 
and ignores content (since all posts have the same content (feed) but different 
text, they will all have different signatures).

A possible fix may be to change MD5Signature to hash content together with 
parse-text. This way, posts in a feed will have different signatures but 
MD5Signature's behaviour will stay approximately the same.

Anyway, for now, you can just change db.signature.class option.

 Feed plugin only indexes one post per feed due to identical digest
 --

 Key: NUTCH-579
 URL: https://issues.apache.org/jira/browse/NUTCH-579
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Joseph Chen

 When parsing an rss feed, only one post will be indexed per feed.  The reason 
 for this is that the digest, which is calculated for based on the content (or 
 the url if the content is null) is always the same for each post in a feed.
 I noticed this when I was examining my lucene indexes using Luke.  All of the 
 individual feed entries were being indexed properly but then when the dedup 
 step ran, my merged index ended up with only one document.
 As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
 by adding the following code to the filter function:
 byte[] signature = MD5Hash.digest(url.toString()).getDigest();
 doc.removeField(digest);
 doc.add(new Field(digest, StringUtil.toHexString(signature), 
 Field.Store.YES, Field.Index.NO));
 This seems to fix the issue as the index now contains the proper number of 
 documents.
 Anyone have any comments on whether this is a good solution or if there is a 
 better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.