Hi Lance, Elephant Bird includes support for SequenceFile i/o from Pig:
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java It's available in Maven Central: http://search.maven.org/#artifactdetails%7Ccom.twitter.elephantbird%7Celephant-bird-pig%7C3.0.1%7Cjar <dependency> <groupId>com.twitter.elephantbird</groupId> <artifactId>elephant-bird-pig</artifactId> <version>3.0.1</version> </dependency> Andy @sagemintblue On Wed, Jul 4, 2012 at 5:33 PM, Lance Norskog <[email protected]> wrote: > Ah, didn't know that about Nutch files. I've only used the Nutch -> > Solr integration. Does Pig make sequence files? Is there a Nutch->Pig > integration? > > On Tue, Jul 3, 2012 at 3:00 AM, Alexander Aristov > <[email protected]> wrote: > > Hi Lance > > > > I understand that pages are pages but nutch stores pages in its own > format > > while mahout operates with other data formats. > > > > I would like to merge nutch and mahout with minimun efforts that's why I > > question what is easier. Alter mahout and implement logic to read/write > > nutch data or implement nutch plugin to invoke mahout. > > > > How difficult is to inject mahout engine into other java programs? Will > it > > be enough to add jar files or it requires some configuration files and > > environmant variables set? > > > > Best Regards > > Alexander Aristov > > > > > > On 3 July 2012 06:41, Lance Norskog <[email protected]> wrote: > > > >> Pages are pages. Mahout does not care where they came from. I guess > >> you want a parser for HTML pages. > >> > >> On Mon, Jul 2, 2012 at 12:11 PM, Alexander Aristov > >> <[email protected]> wrote: > >> > Forward it to user list and mahout group. > >> > > >> > Like-minded, any suggestions about integration? What shall I start > with? > >> > > >> > > >> > Best Regards > >> > Alexander Aristov > >> > > >> > > >> > ---------- Forwarded message ---------- > >> > From: Alexander Aristov <[email protected]> > >> > Date: 1 July 2012 23:02 > >> > Subject: nucth and mahout integration > >> > To: [email protected] > >> > > >> > > >> > People > >> > > >> > can you give me some advises? > >> > > >> > I want to integrate nutch and mahout to classify crawled pages. > >> > > >> > 1st question: Has someone tried this and are there any libraries > >> available? > >> > > >> > next: What is better/easier? Improve nutch and inject mahout > classifier > >> > into the project OR improve mahout to add an ability to read and write > >> > nutch files? > >> > > >> > Best Regards > >> > Alexander Aristov > >> > >> > >> > >> -- > >> Lance Norskog > >> [email protected] > >> > > > > -- > Lance Norskog > [email protected] >
