Re: Extracting data from websites

Pat Ferrel Mon, 30 Jul 2012 07:30:46 -0700

You may want to look at Bixo (openbixo.org), which is a web crawlerbuilt on hadoop.

There is a little extension to it that parses into plain text usingboilerpipe (removes boilerplate text from pages) and Tika. The cralwerwill take a list of URLs and filter them with regex's (in or out). Itthen puts the text into hadoop sequence files, which can be directlyinput to mahout for vectorization, the first step to most of the mahoutanalysis features including clustering and classifications.

I forked the code to add the hadoop output, boilerpipe, and filtering,all of which are command line driven:

https://github.com/pferrel/bixo

The project includes a couple tools for independent tasks--just ignore them.

Read Mahout in Action. It will make your next two weeks more productive.

On 7/30/12 5:59 AM, David Rose wrote:

The clustering and classification is something that we want to use.  We would 
want to grab news from sites we input on specific industries or companies, and 
then have them classified based on relevance.

On Jul 30, 2012, at 8:26 AM, Sean Owen wrote:

Extract as in web crawl? No it's nothing to do with that.
Extract as in entity extraction? I don't think there are relevant
implementations here either, though that begins to border on machine
learning.
This is more about clustering and classification of documents than anything
else.

On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote:

Hi all,

I  apologize for how basic my question is, but I am very new to all of
this, machine learning, writing code, all of it.  I was finally able to get
Mahout downloaded, installed, and running.  I was assigned a project at my
work to try to use Mahout to extract data from websites that we input.  Is
this possible? Can anyone help me with suggestions or instructions on how
to do so? I appreciate any help on this, as I have only two more weeks to
finish this project.

Thanks,

David Rose

Re: Extracting data from websites

Reply via email to