Ioan- when you understand them, please explain them here:

https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats

On Mon, Nov 21, 2011 at 12:57 AM, Ioan Eugen Stan <[email protected]>wrote:

> I'll try in the next few days to track down the numbers from running the
>> stuff in my recent IBM article: http://www.ibm.com/**
>> developerworks/java/library/j-**mahout-scaling/<http://www.ibm.com/developerworks/java/library/j-mahout-scaling/>.
>>  Or, you can go run them yourself!
>>
>
> I think posting some reference data for the jobs will be great. I will
> have something to compare to when I have something done. In the mean time I
> will try to do a quick and dirty implementation working and see how things
> move and post my findings. This could take a while as I depend on some
> modifications.
>
>
>  Otherwise, I don't know that we have any formula just yet.  I suspect
>> that once you reach a certain number of documents, your dictionary will
>> stop growing, more or less.  Then, it is just a question of how many
>> vectors you have and the sparseness.  This probably could be guessed at by
>> looking at what the average number of words are in your email collection.
>>  Naturally, attachments may skew this if you are including them.
>>
>
> I also suspect that things will be asymptotically after a certain number
> of documents, remains to see where that threshold is.
>
>
>  That has been my experience, too.  Seq2Sparse is often the long part.  I
>> suspect one could get it done a lot faster in Lucene.
>>  SequenceFilesFromDirectory is also slow, but that is inherently sequential.
>>
>
> I will be able to use a map reduce job to create vectors or just create
> them as an indexing step so I hope this step will not count when
> considering the effective clustering time.
>
>
>  I haven't explored yet what it would mean to use Encoded vectors in
>> Clustering, but perhaps I can call Ted to the front of the class and see if
>> he has thoughts on whether that even makes sense, as that would give you a
>> fixed size Vector.
>>
>> -Grant
>>
>
> I don't know about encoded vectors yet, I hope to get some more info on
> them from Mahout in Action. If they do what I think they do, I will
> definitely try them, and probably complain on the list (Ted) if I can't
> interpret them right :).
>
> Thanks for the reply,
>
> --
> Ioan Eugen Stan
>



-- 
Lance Norskog
[email protected]

Reply via email to