Andy, Thanks for popping up!
Elephant bird looks like it has awesome potential to make machine learning with Hadoop vastly easier. It is really good to see this kind of response ... that is what turns potential into action. Thanks again. On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected] > wrote: > Hi Colum, I'm an ElephantBird project committer and wrote both > SequenceFileStorage and the VectorWritableConverter. > > The default Writable type used by SequenceFileStorage for both key and > value is Text, hence the Text data when you don't provide extra > configuration. > > Could you provide some sample data or task attempt logs from your job to > help diagnose the issue? Unit tests for both of these utils cover a lot of > edge cases, but if you've found a new one I'd like to get it sorted out! > > Thanks, > Andy > > > > > On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote: > > > I haven't touched elephant bird in some time. I had some fits with it at > > the time that I used it whenever I strayed from the well-trod path, but I > > had heard it was much better lately. > > > > Sorry not to be much more help than that. > > > > On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]> > wrote: > > > > > I am trying to store Mahout RandomAccessSparseVector using > > > elephant-bird and pig. The data is of the form > > > key(text),value(RandomAccessSparseVector). when I run pig describe it > > > presents the following: > > > > > > pair: {key: int,val: (cardinality: int,entries: {entry: (index: > > > int,value: double)})} > > > > > > My problem is that when I try to store tuples using elephant-bird's > > > SequenceFileStorage as follows: > > > > > > store clusteredOut into 'logsvectors.dat' using > > > com.twitter.elephantbird.pig.store.SequenceFileStorage ( > > > '-c com.twitter.elephantbird.pig.util.TextConverter', > > > '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter -- > > > -sparse' > > > ); > > > > > > It runs successfully but when I examine the resulting Sequencefile all > > > the vectors are empty. > > > > > > On the other hand, if I run the following instead: > > > > > > store clusteredOut into 'logsvectors.dat' using > > > com.twitter.elephantbird.pig.store.SequenceFileStorage (); > > > > > > ie do not specify the types of the key or value. > > > > > > The vectors are non-empty but are of type text..and this causes my > > > clustering algorithm to fail(as they are expecting VectorWritable). > > > > > > So my problem is that I need to output in VectorFileFormat, but when I > > > do the resulting vectors are empty. > > > > > > Anyone else have experience with this issue? > > > > > > Many thanks, > > > Colum > > > > > >
