[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang commented on NUTCH-1755: - [~lewismc], right now, the project name

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:11 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:10 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:12 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:11 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:16 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:14 AM

[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072796#comment-14072796 ] bin wang commented on NUTCH-1755: - [~lewismc] Yes. I am looking at the trunk. Then should

Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-04 Thread Bin Wang
understanding, I can see Nutch constantly uses Hadoop API without hadoop pre-installed.. why can't my code work.. Well, any hint or directional guidance will be appreciated, many thanks! /usr/bin On Sat, Jan 4, 2014 at 12:38 AM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Bin Wang, I would

Independent Map Reduce to parse Nutch content (Cont.)

2014-01-03 Thread Bin Wang
Hi, I tried to modify the code here to parse the nutch content data... http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup And in the end of this email is a prototype that I have written to run map reduce to calculate the HTML content length of

Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang
Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it turned out that there are many many

use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Bin Wang
Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can

Nutch Crawl a Specific List Of URLs (150K)

2013-12-27 Thread Bin Wang
Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the

Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-22 Thread Bin Wang
Hi there, I was following the RunNutchInEclipse tutorial (1.7 Nutch / trunk example). After I configured the java run configurations as the tutorial showed.. and clicked run. It did not show the injector process as shown in the tutorial, and instead, it showed error: Usage: Injector