Is there a way to combine both Apache Nutch and Mahout in order to do what I am trying to do? On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote:
> If you want to develop scrapers, I suggest you take a look at jsoup ( > http://jsoup.org/), which allows you to parse HTML easily. If you need > subsequent classification of the websites, then maybe you'll need Mahout > > On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[email protected]> wrote: > >> Extract as in web crawl? No it's nothing to do with that. >> Extract as in entity extraction? I don't think there are relevant >> implementations here either, though that begins to border on machine >> learning. >> This is more about clustering and classification of documents than anything >> else. >> >> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote: >> >>> Hi all, >>> >>> I apologize for how basic my question is, but I am very new to all of >>> this, machine learning, writing code, all of it. I was finally able to >> get >>> Mahout downloaded, installed, and running. I was assigned a project at >> my >>> work to try to use Mahout to extract data from websites that we input. >> Is >>> this possible? Can anyone help me with suggestions or instructions on how >>> to do so? I appreciate any help on this, as I have only two more weeks to >>> finish this project. >>> >>> Thanks, >>> >>> David Rose >>
