It is quite possible. If the new columns represent a relatively small contribution rather than a wholesale change in the statistics of the corpus (which is almost always true) then you can just add these columns and compute IDF weights for the new terms based on the updated corpus statistics. You don't need to update the old IDF weights because the number of documents isn't going to change a lot and the old terms probably occur in the new documents at about the same rate anyway.
Of course, you do have to go back through an add the zero columns to the old data. One work-around is to use really, really big vectors to start with and hope that nobody ever accidentally fills in one of these vectors. This is cool with sparse vectors since zeros aren't store so all of the unused columns have no impact. New vectors can have new columns, but old ones need no change since they effectively already have these columns. A second possible work-around is to use the hashed encoding. This costs a bit more for encoding, but it gives you static vector sizes. For some algorithms, this is a huge win (SGD for example where we need to allocate a dense matrix). On Fri, Jun 24, 2011 at 8:52 AM, Mark <[email protected]> wrote: > Is it possible to add more dimensions to an existing TF-IDF vector? If so > how would it be possible to determine what appropriate weighting to give to > these new fields to make sure its not too much/too little? >
