If you want to develop scrapers, I suggest you take a look at jsoup ( http://jsoup.org/), which allows you to parse HTML easily. If you need subsequent classification of the websites, then maybe you'll need Mahout
On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[email protected]> wrote: > Extract as in web crawl? No it's nothing to do with that. > Extract as in entity extraction? I don't think there are relevant > implementations here either, though that begins to border on machine > learning. > This is more about clustering and classification of documents than anything > else. > > On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote: > > > Hi all, > > > > I apologize for how basic my question is, but I am very new to all of > > this, machine learning, writing code, all of it. I was finally able to > get > > Mahout downloaded, installed, and running. I was assigned a project at > my > > work to try to use Mahout to extract data from websites that we input. > Is > > this possible? Can anyone help me with suggestions or instructions on how > > to do so? I appreciate any help on this, as I have only two more weeks to > > finish this project. > > > > Thanks, > > > > David Rose >
