The easiest web crawler I know of is 'wget'.

On Mon, Jul 30, 2012 at 7:31 AM, David Rose <[email protected]> wrote:
> Is there a way to combine both Apache Nutch and Mahout in order to do what I 
> am trying to do?
> On Jul 30, 2012, at 8:29 AM, Xavier Rampino wrote:
>
>> If you want to develop scrapers, I suggest you take a look at jsoup (
>> http://jsoup.org/), which allows you to parse HTML easily. If you need
>> subsequent classification of the websites, then maybe you'll need Mahout
>>
>> On Mon, Jul 30, 2012 at 2:26 PM, Sean Owen <[email protected]> wrote:
>>
>>> Extract as in web crawl? No it's nothing to do with that.
>>> Extract as in entity extraction? I don't think there are relevant
>>> implementations here either, though that begins to border on machine
>>> learning.
>>> This is more about clustering and classification of documents than anything
>>> else.
>>>
>>> On Mon, Jul 30, 2012 at 1:22 PM, David Rose <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I  apologize for how basic my question is, but I am very new to all of
>>>> this, machine learning, writing code, all of it.  I was finally able to
>>> get
>>>> Mahout downloaded, installed, and running.  I was assigned a project at
>>> my
>>>> work to try to use Mahout to extract data from websites that we input.
>>> Is
>>>> this possible? Can anyone help me with suggestions or instructions on how
>>>> to do so? I appreciate any help on this, as I have only two more weeks to
>>>> finish this project.
>>>>
>>>> Thanks,
>>>>
>>>> David Rose
>>>
>



-- 
Lance Norskog
[email protected]

Reply via email to