thanks for the insights Ted On 9 Mar 2013, at 18:40, Ted Dunning <[email protected]> wrote:
> SVD techniques probably won't actually help that much given your current > sparsity. There are two issues: > > first, your data is already quite small. SVD will only make it larger > because the average number of non-zero elements will increase dramatically. > > second, given your sparsity, SVD will have very little to work with. Very > sparse data elements are inherently nearly orthogonal. > > I think you need to find more features so that your average number of > non-zeros goes up. > > On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <[email protected]> wrote: > >> Thanks a lot Ted. I think there's some preprocessing I can do to remove >> some outliers which may reduce my matrix size considerably.ill also check >> out some SVD techniques >> On 9 Mar 2013 17:16, "Ted Dunning" <[email protected]> wrote: >> >>> The new streaming k-means should be able to handle that data pretty >>> efficiently. My guess is that on a single 16 core machine if should be >>> able to complete the clustering in 10 minutes or so. That is >> extrapolation >>> and thus could be wildly off, of course. >>> >>> You definitely mean sparse. 30 M / 20 M = 1.5 non-zero features per row. >>> That may be a problem. Or it might make the clustering fairly trivial. >>> >>> Dan, >>> >>> That code isn't checked into trunk yet, but I think. Can you comment on >>> where working code can be found on github? >>> >>> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> >> wrote: >>> >>>> I have approximately 20million items and a feature vector of approx 30 >>>> million in length,very sparse. >>>> >>>> Would you have any suggestions for other clustering algorithms I should >>>> look at ? >>>> >>>> Thanks, >>>> Colum >>>> >>>> On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote: >>>> >>>>> You are beginning to exit the realm of reasonable applicability for >>>> normal >>>>> k-means algorithms here. >>>>> >>>>> How much data do you have? >>>>> >>>>> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> >>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> When I run KMeans clustering on a cluster, i notice that when I have >>>>>> "large" values for k (i.e approx >1000) I get loads of hadoop write >>>>>> errors: >>>>>> >>>>>> INFO hdfs.DFSClient: Exception in createBlockOutputStream >>>>>> java.net.SocketTimeoutException: 69000 millis timeout while waiting >>>>>> for channel to be ready for read. ch : >> java.nio.channels.SocketChannel >>>>>> >>>>>> This continues indefinitely and lots of part-0xxxxx files are >> produced >>>>>> of sizes of around 30kbs. >>>>>> >>>>>> If I reduce the value for k it runs fine. Furthermore If I run it in >>>>>> local mode with high values of k it runs fine. >>>>>> >>>>>> The command I am using is as follows: >>>>>> >>>>>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults >>>>>> --clusters tmp -dm >>>>>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure >> -cd >>>>>> 1.0 -x 20 -cl -k 10000 >>>>>> >>>>>> I am running mahout 0.7. >>>>>> >>>>>> Are there some performance parameters I need to tune for mahout when >>>>>> dealing with large volumes of data? >>>>>> >>>>>> Thanks, >>>>>> Colum >>
