[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860114#comment-13860114 ] Markus Jelsma commented on NUTCH-1325: -- Hi Tejas - i think most seems fine now and i

[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860121#comment-13860121 ] Markus Jelsma commented on NUTCH-1080: -- +1! Type safe members , arguments for

[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860124#comment-13860124 ] Markus Jelsma commented on NUTCH-1670: -- +1 set same crawldb directory in mergedb

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140 ] Markus Jelsma commented on NUTCH-1360: -- Almost all unit tests fail due to improper

[jira] [Reopened] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-1360: -- Suport the storing of IP address connected to when web crawling

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142 ] Markus Jelsma commented on NUTCH-1360: -- {code} --- conf/nutch-default.xml

[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-356: Attachment: NUTCH-356-trunk.patch Updated patch for trunk. All tests pass. According to

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153 ] Markus Jelsma commented on NUTCH-1360: -- Committed revision 1554791. Suport the

[jira] [Resolved] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1360. -- Resolution: Fixed This issue is not in 2.x, just trunk. All tests pass again. Suport the

Build failed in Jenkins: Nutch-trunk #2472

2014-01-02 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2472/changes Changes: [markus] NUTCH-1360 fix entity in configuration -- [...truncated 6752 lines...] clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file =

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222 ] Hudson commented on NUTCH-1360: --- FAILURE: Integrated in Nutch-trunk #2472 (See

[jira] [Created] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1691: Summary: DomainBlacklist url filter does not allow -D filter file override Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch

[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860281#comment-13860281 ] Markus Jelsma commented on NUTCH-1691: -- This means existing behaviour is unchanged,

[jira] [Updated] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1691: - Attachment: NUTCH-1691-trunk.patch Patch for trunk. This fixes the issue by defaulting it in

[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860288#comment-13860288 ] Markus Jelsma commented on NUTCH-1691: -- Well, there is a small issue now: {code} WARN

[jira] [Created] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1692: Summary: SegmentReader broken in distributed mode Key: NUTCH-1692 URL: https://issues.apache.org/jira/browse/NUTCH-1692 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1692: - Attachment: NUTCH-1692-trunk.patch Patch for trunk. Fix works, issue is gone. SegmentReader

[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860643#comment-13860643 ] Tejas Patil commented on NUTCH-1080: Committed to trunk (rev 1554881). Will port the

[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860678#comment-13860678 ] Tejas Patil commented on NUTCH-1691: Hi [~markus17], Its a good solution. +1 from me.

[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860739#comment-13860739 ] Hudson commented on NUTCH-1080: --- FAILURE: Integrated in Nutch-trunk #2473 (See

[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860740#comment-13860740 ] Hudson commented on NUTCH-1670: --- FAILURE: Integrated in Nutch-trunk #2473 (See

[jira] [Commented] (NUTCH-1454) parsing chm failed

2014-01-02 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860803#comment-13860803 ] Tejas Patil commented on NUTCH-1454: TIKA-1122 is fixed and I have verified that

Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang
Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it turned out that there are many many

use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Bin Wang
Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can

[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142 ] Tien Nguyen Manh commented on NUTCH-1686: - In this patch i also fixed an bug with

Build failed in Jenkins: Nutch-trunk #2474

2014-01-02 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2474/ -- [...truncated 6749 lines...] deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile:

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Issue Type: New Feature (was: Bug) TextMD5Signatue compute on textual content

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Fix Version/s: 2.3 TextMD5Signatue compute on textual content

[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195 ] Tien Nguyen Manh commented on NUTCH-1693: - this patch only work with a minor

Re: use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Tejas Patil
Here is what I would do: If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the 'bin/nutch parse' command and see how

Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-02 Thread Tejas Patil
The config 'fs.default.name' of core-site.xml is what makes this happen. Its default value is file:/// which corresponds to local mode of Hadoop. In local mode Hadoop looks for paths on the local file system. In distributed mode of Hadoop, 'fs.default.name' would be hdfs://IP_OF_NAMENODE/ and it

[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861217#comment-13861217 ] Tejas Patil commented on NUTCH-356: --- +1 for commit. Plugin repository cache can lead to