from:"Bin Wang"

[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang commented on NUTCH-1755: - [~lewismc], right now, the project name

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:11 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:10 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:12 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:11 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:16 AM

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072736#comment-14072736 ] bin wang edited comment on NUTCH-1755 at 7/24/14 3:14 AM

[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

2014-07-23 Thread bin wang (JIRA)

[ https://issues.apache.org/jira/browse/NUTCH-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072796#comment-14072796 ] bin wang commented on NUTCH-1755: - [~lewismc] Yes. I am looking at the trunk. Then should

Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-04 Thread Bin Wang

understanding, I can see Nutch constantly uses Hadoop API without hadoop pre-installed.. why can't my code work.. Well, any hint or directional guidance will be appreciated, many thanks! /usr/bin On Sat, Jan 4, 2014 at 12:38 AM, Tejas Patil tejas.patil...@gmail.comwrote: Hi Bin Wang, I would

Independent Map Reduce to parse Nutch content (Cont.)

2014-01-03 Thread Bin Wang

Hi, I tried to modify the code here to parse the nutch content data... http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup And in the end of this email is a prototype that I have written to run map reduce to calculate the HTML content length of

Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang

Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it turned out that there are many many

use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Bin Wang

Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can

Nutch Crawl a Specific List Of URLs (150K)

2013-12-27 Thread Bin Wang

Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the

Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-22 Thread Bin Wang

Hi there, I was following the RunNutchInEclipse tutorial (1.7 Nutch / trunk example). After I configured the java run configurations as the tutorial showed.. and clicked run. It did not show the injector process as shown in the tutorial, and instead, it showed error: Usage: Injector

[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Comment Edited] (NUTCH-1755) Project name bug in build.xml

[jira] [Commented] (NUTCH-1755) Project name bug in build.xml

Re: Independent Map Reduce to parse Nutch content (Cont.)

Independent Map Reduce to parse Nutch content (Cont.)

Re: Nutch Crawl a Specific List Of URLs (150K)

use Map Reduce + Jsoup to parse big Nutch/Content file

Nutch Crawl a Specific List Of URLs (150K)

Step Through Nutch 1.7 Inside Eclipse Missing Argument

14 matches

Site Navigation

Mail list logo

Footer information