Re: Extracting data from websites

David Rose Mon, 30 Jul 2012 07:30:50 -0700

Is there a way to combine both Apache Nutch and Mahout in order to do what I am 
trying to do?
On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote:


> If you want to develop scrapers, I suggest you take a look at jsoup (
> http://jsoup.org/), which allows you to parse HTML easily. If you need
> subsequent classification of the websites, then maybe you'll need Mahout
> 
> On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[email protected]> wrote:
> 
>> Extract as in web crawl? No it's nothing to do with that.
>> Extract as in entity extraction? I don't think there are relevant
>> implementations here either, though that begins to border on machine
>> learning.
>> This is more about clustering and classification of documents than anything
>> else.
>> 
>> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> I  apologize for how basic my question is, but I am very new to all of
>>> this, machine learning, writing code, all of it.  I was finally able to
>> get
>>> Mahout downloaded, installed, and running.  I was assigned a project at
>> my
>>> work to try to use Mahout to extract data from websites that we input.
>> Is
>>> this possible? Can anyone help me with suggestions or instructions on how
>>> to do so? I appreciate any help on this, as I have only two more weeks to
>>> finish this project.
>>> 
>>> Thanks,
>>> 
>>> David Rose
>>

Re: Extracting data from websites

Reply via email to