Ah, didn't know that about Nutch files. I've only used the Nutch -> Solr integration. Does Pig make sequence files? Is there a Nutch->Pig integration?
On Tue, Jul 3, 2012 at 3:00 AM, Alexander Aristov <[email protected]> wrote: > Hi Lance > > I understand that pages are pages but nutch stores pages in its own format > while mahout operates with other data formats. > > I would like to merge nutch and mahout with minimun efforts that's why I > question what is easier. Alter mahout and implement logic to read/write > nutch data or implement nutch plugin to invoke mahout. > > How difficult is to inject mahout engine into other java programs? Will it > be enough to add jar files or it requires some configuration files and > environmant variables set? > > Best Regards > Alexander Aristov > > > On 3 July 2012 06:41, Lance Norskog <[email protected]> wrote: > >> Pages are pages. Mahout does not care where they came from. I guess >> you want a parser for HTML pages. >> >> On Mon, Jul 2, 2012 at 12:11 PM, Alexander Aristov >> <[email protected]> wrote: >> > Forward it to user list and mahout group. >> > >> > Like-minded, any suggestions about integration? What shall I start with? >> > >> > >> > Best Regards >> > Alexander Aristov >> > >> > >> > ---------- Forwarded message ---------- >> > From: Alexander Aristov <[email protected]> >> > Date: 1 July 2012 23:02 >> > Subject: nucth and mahout integration >> > To: [email protected] >> > >> > >> > People >> > >> > can you give me some advises? >> > >> > I want to integrate nutch and mahout to classify crawled pages. >> > >> > 1st question: Has someone tried this and are there any libraries >> available? >> > >> > next: What is better/easier? Improve nutch and inject mahout classifier >> > into the project OR improve mahout to add an ability to read and write >> > nutch files? >> > >> > Best Regards >> > Alexander Aristov >> >> >> >> -- >> Lance Norskog >> [email protected] >> -- Lance Norskog [email protected]
