Avoid parsing uneccessary links and get a more relevant outlink list
Key: NUTCH-488
URL: https://issues.apache.org/jira/browse/NUTCH-488
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-488:
Attachment: DOMContentUtils.patch
Avoid parsing uneccessary links and get a more relevant outlink
[
https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-489:
Attachment: SuffixURLFilter.java.patch
suffix-urlfilter.txt.patch
URLFilter-suffix
[
https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-489:
Attachment: SuffixURLFilter_v2.java.patch
My mistake...
I've added a new patchwhich is supposed
Add hadoop masters configuration file into conf folder
--
Key: NUTCH-500
URL: https://issues.apache.org/jira/browse/NUTCH-500
Project: Nutch
Issue Type: Improvement
Components:
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922
]
Emmanuel Joke commented on NUTCH-503:
-
I just try your patch and i'm afraid I still have the same issue.
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469
]
Emmanuel Joke commented on NUTCH-503:
-
Sorry, my mistake.
My compiled jar was not correctly included in my
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039
]
Emmanuel Joke commented on NUTCH-503:
-
Results seems to good. So I'm wondering if it is possible to commit this
${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker
--
Key: NUTCH-508
URL: https://issues.apache.org/jira/browse/NUTCH-508
Project: Nutch
Update Crawldb: avoid to start a job if there is no valid segment
-
Key: NUTCH-509
URL: https://issues.apache.org/jira/browse/NUTCH-509
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-509:
Attachment: crawldb.patch
In this patch, I've added a simple boolean to start the job only we have
[
https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511038
]
Emmanuel Joke commented on NUTCH-509:
-
You're right. In this case, I will close the JIRA
Update Crawldb: avoid
[
https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke closed NUTCH-509.
---
Resolution: Won't Fix
As explain by Doğacan, the Crawldb update has a good behaviour. This patch is
[
https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-516:
Attachment: NUTCH-516.patch
I fxied the issue by changing the FetchTime in
Use URLValidator in the Injector
Key: NUTCH-522
URL: https://issues.apache.org/jira/browse/NUTCH-522
Project: Nutch
Issue Type: Improvement
Components: injector
Reporter: Emmanuel Joke
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v2.patch
Oops, my mistake. Please find an updated patch.
Actually I've a
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153
]
Emmanuel Joke commented on NUTCH-522:
-
Actually I tried to fetch the url
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-526:
Attachment: NUTCH-526.patch
patch provided
Use a combiner in LinDbMerger to improve the
Use a combiner in LinDbMerger to improve the performance as in LinkDb
-
Key: NUTCH-526
URL: https://issues.apache.org/jira/browse/NUTCH-526
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-528:
Attachment: NUTCH-528.patch
patch attached
CrawlDbReader: add some new stats + dump into a csv
CrawlDbReader: add some new stats + dump into a csv format
--
Key: NUTCH-528
URL: https://issues.apache.org/jira/browse/NUTCH-528
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-529:
Attachment: NUTCH-529.patch
patch attached
NodeWalker.skipChildren don't wrok for more than 1
NodeWalker.skipChildren don't wrok for more than 1 child.
-
Key: NUTCH-529
URL: https://issues.apache.org/jira/browse/NUTCH-529
Project: Nutch
Issue Type: Bug
Reporter:
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v3.patch
Use URLValidator in the Injector
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v3.patch
commons-validator's UrlValidator does not filter URLS with space.
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: (was: NUTCH-522_v3.patch)
Use URLValidator in the Injector
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138
]
Emmanuel Joke commented on NUTCH-522:
-
I tried with protocol-http and protocol-httpclient, i got the same error
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516209
]
Emmanuel Joke commented on NUTCH-526:
-
Actually i made a simple test on 2 small linkdb, and i didn't see any
Add a combiner to improve performance on updatedb
-
Key: NUTCH-530
URL: https://issues.apache.org/jira/browse/NUTCH-530
Project: Nutch
Issue Type: Improvement
Environment: java 1.6
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-530:
Attachment: NUTCH-530.patch
Patch provided.
It reduced the process time by 20%.
Output from the
CrawlDbMerger: wrong computation of last fetch time
---
Key: NUTCH-532
URL: https://issues.apache.org/jira/browse/NUTCH-532
Project: Nutch
Issue Type: Bug
Reporter: Emmanuel Joke
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532.patch
Patch provided.
CrawlDbMerger: wrong computation of last fetch time
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: (was: NUTCH-532.patch)
CrawlDbMerger: wrong computation of last fetch time
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532.patch
CrawlDbMerger: wrong computation of last fetch time
LinkDbMerger: url normlaized is not updated in the key and inlinks list
---
Key: NUTCH-533
URL: https://issues.apache.org/jira/browse/NUTCH-533
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-533:
Attachment: NUTCH-533.patch
Patch provided
LinkDbMerger: url normlaized is not updated in the key
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516358
]
Emmanuel Joke commented on NUTCH-526:
-
Could you please wait again few days ?
I would like to wait for a
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602
]
Emmanuel Joke commented on NUTCH-530:
-
I'm sure to follow your point regarding the outlinks number.
I don't
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516618
]
Emmanuel Joke commented on NUTCH-532:
-
res.getFetchTime() - Math.round(res.getFetchInterval() * 1000d); always
[
https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-534:
Attachment: NUTCH-534.patch
Patch provided
SegmentMerger: add -normalize option
SegmentMerger: add -normalize option
Key: NUTCH-534
URL: https://issues.apache.org/jira/browse/NUTCH-534
Project: Nutch
Issue Type: Improvement
Reporter: Emmanuel Joke
Assignee:
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675
]
Emmanuel Joke commented on NUTCH-530:
-
Actually I don't re-use CrawlDbReducer, I've define a new class as
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v2.patch
New patch provided
* Add new method to CrawlDatum
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v3.patch
My mistake, acually i'm not really familiar with the VERSION.
I
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v4.patch
I updated the code following Andrzej comments. I've also update the
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532-test.patch
Please find a patch which fix the JUNIT test.
CrawlDbMerger:
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke closed NUTCH-526.
---
Resolution: Won't Fix
No improvement.
Use a combiner in LinDbMerger to improve the performance as
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-528:
Attachment: NUTCH-528_v2.patch
New Patch provided, It includes the new options as requested by DG.
[
https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-529:
Attachment: TestNodeWalker.java
Junit test provided.
NodeWalker.skipChildren doesn't work for
Move URLNormalizer from Outlink to ParseOutputFormat
Key: NUTCH-548
URL: https://issues.apache.org/jira/browse/NUTCH-548
Project: Nutch
Issue Type: Improvement
Components:
[
https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-548:
Attachment: NUTCH-548.patch
Patch provided
Move URLNormalizer from Outlink to ParseOutputFormat
[
https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524669
]
Emmanuel Joke commented on NUTCH-548:
-
Actually I've one comment/question. I noticed that we normalize and filter
[
https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525452
]
Emmanuel Joke commented on NUTCH-548:
-
My mistake, you re right i was using the command crawl to make my test,
[
https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-529:
Attachment: TestNodeWalker.java
Another version without dependency to Neko.
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528729
]
Emmanuel Joke commented on NUTCH-557:
-
Did you notice any difference in term of performance ? improvement or
[
https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-529:
Attachment: (was: TestNodeWalker.java)
NodeWalker.skipChildren doesn't work for more than 1
[
https://issues.apache.org/jira/browse/NUTCH-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532552
]
Emmanuel Joke commented on NUTCH-508:
-
It is Mathijs Homminga
${hadoop.log.dir} and ${hadoop.log.file} are not
[
https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-548:
Attachment: NUTCH-548.patch.v2
New patch which remove unused parameter and fix the plugin parser
Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
-
Key: NUTCH-592
URL: https://issues.apache.org/jira/browse/NUTCH-592
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-592:
Attachment: patch.txt
Patch provided.
Fetcher2 : NPE for page with status
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554555
]
Emmanuel Joke commented on NUTCH-528:
-
I'm wondering if somebody could review this patch and eventually commit it
[
https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554572
]
Emmanuel Joke commented on NUTCH-534:
-
Hi Andrzej, would you mind to review this patch too and give us your
[
https://issues.apache.org/jira/browse/NUTCH-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554571
]
Emmanuel Joke commented on NUTCH-595:
-
I had a similar issue and i follow the instruction done by Dennis and it
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-528:
Attachment: NUTCH-528_v3.patch
New path provided following Andrzej recommandations:
??*
ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
---
Key: NUTCH-596
URL: https://issues.apache.org/jira/browse/NUTCH-596
Project: Nutch
Issue
Remove deprecated use of ToolBase, Migration to the new implementation
--
Key: NUTCH-598
URL: https://issues.apache.org/jira/browse/NUTCH-598
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-598:
Attachment: NUTCH-598.patch
Patch provided
It includes:
- remove ToolBase call and move to the new
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555840#action_12555840
]
Emmanuel Joke commented on NUTCH-559:
-
Dogocan, is there any chance that you commit this
[
https://issues.apache.org/jira/browse/NUTCH-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555841#action_12555841
]
Emmanuel Joke commented on NUTCH-580:
-
I've been using your patch for a while now and it
[
https://issues.apache.org/jira/browse/NUTCH-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555843#action_12555843
]
Emmanuel Joke commented on NUTCH-531:
-
It looks like this issue has been solved with the
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555840#action_12555840
]
jokeout edited comment on NUTCH-559 at 1/4/08 1:55 AM:
-
Dogacan,
[
https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12555890#action_12555890
]
Emmanuel Joke commented on NUTCH-596:
-
I agree with you the proper solution will be the
[
https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-598:
Attachment: NUTCH-598.v2.patch
Thanks Dogacan for your update.
New patch provided. Most of the
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557969#action_12557969
]
Emmanuel Joke commented on NUTCH-528:
-
Andrzej, did you have the time to review my new
[
https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557968#action_12557968
]
Emmanuel Joke commented on NUTCH-534:
-
Andrzej, do you think it will be possible to
[
https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559378#action_12559378
]
Emmanuel Joke commented on NUTCH-363:
-
FYI, The operation to normalize link within the
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566950#action_12566950
]
Emmanuel Joke commented on NUTCH-567:
-
Hi Dogacan, do you think you will commit this new
[
https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12567693#action_12567693
]
Emmanuel Joke commented on NUTCH-596:
-
I didn't find any usefull information in the
[
https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12571874#action_12571874
]
Emmanuel Joke commented on NUTCH-613:
-
I have the same analysis. I just change my local
[
https://issues.apache.org/jira/browse/NUTCH-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12571875#action_12571875
]
Emmanuel Joke commented on NUTCH-598:
-
Hi Dogacan,did you finish to review my patch.Is
[
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-578:
Attachment: NUTCH-578.patch
I've got the same error for page with an HTTP status code = 503.
I
[
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-578:
Attachment: NUTCH-578_v2.patch
Actually i just realised that the setPageRetrySchedule in
[
https://issues.apache.org/jira/browse/NUTCH-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-615:
Attachment: NUTCH-615.patch
Redirected URL are fetched wihtout setting any FetchInterval
Redirected URL are fetched wihtout setting any FetchInterval
Key: NUTCH-615
URL: https://issues.apache.org/jira/browse/NUTCH-615
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-616:
Attachment: NUTCH-616.patch
Patch provided
Reset Fetch Retry counter when fetch is successful
Reset Fetch Retry counter when fetch is successful
--
Key: NUTCH-616
URL: https://issues.apache.org/jira/browse/NUTCH-616
Project: Nutch
Issue Type: Bug
Affects Versions: 1.0.0
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579285#action_12579285
]
Emmanuel Joke commented on NUTCH-530:
-
OK
Add a combiner to improve performance on
87 matches
Mail list logo