The easiest web crawler I know of is 'wget'. On Mon, Jul 30, 2012 at 7:31 AM, David Rose <[email protected]> wrote: > Is there a way to combine both Apache Nutch and Mahout in order to do what I > am trying to do? > On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote: > >> If you want to develop scrapers, I suggest you take a look at jsoup ( >> http://jsoup.org/), which allows you to parse HTML easily. If you need >> subsequent classification of the websites, then maybe you'll need Mahout >> >> On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[email protected]> wrote: >> >>> Extract as in web crawl? No it's nothing to do with that. >>> Extract as in entity extraction? I don't think there are relevant >>> implementations here either, though that begins to border on machine >>> learning. >>> This is more about clustering and classification of documents than anything >>> else. >>> >>> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I apologize for how basic my question is, but I am very new to all of >>>> this, machine learning, writing code, all of it. I was finally able to >>> get >>>> Mahout downloaded, installed, and running. I was assigned a project at >>> my >>>> work to try to use Mahout to extract data from websites that we input. >>> Is >>>> this possible? Can anyone help me with suggestions or instructions on how >>>> to do so? I appreciate any help on this, as I have only two more weeks to >>>> finish this project. >>>> >>>> Thanks, >>>> >>>> David Rose >>> >
-- Lance Norskog [email protected]
