You may want to look at Bixo (openbixo.org), which is a web crawler
built on hadoop.
There is a little extension to it that parses into plain text using
boilerpipe (removes boilerplate text from pages) and Tika. The cralwer
will take a list of URLs and filter them with regex's (in or out). It
then puts the text into hadoop sequence files, which can be directly
input to mahout for vectorization, the first step to most of the mahout
analysis features including clustering and classifications.
I forked the code to add the hadoop output, boilerpipe, and filtering,
all of which are command line driven:
https://github.com/pferrel/bixo
The project includes a couple tools for independent tasks--just ignore them.
Read Mahout in Action. It will make your next two weeks more productive.
On 7/30/12 5:59 AM, David Rose wrote:
The clustering and classification is something that we want to use. We would
want to grab news from sites we input on specific industries or companies, and
then have them classified based on relevance.
On Jul 30, 2012, at 8:26 AM, Sean Owen wrote:
Extract as in web crawl? No it's nothing to do with that.
Extract as in entity extraction? I don't think there are relevant
implementations here either, though that begins to border on machine
learning.
This is more about clustering and classification of documents than anything
else.
On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote:
Hi all,
I apologize for how basic my question is, but I am very new to all of
this, machine learning, writing code, all of it. I was finally able to get
Mahout downloaded, installed, and running. I was assigned a project at my
work to try to use Mahout to extract data from websites that we input. Is
this possible? Can anyone help me with suggestions or instructions on how
to do so? I appreciate any help on this, as I have only two more weeks to
finish this project.
Thanks,
David Rose