Re: Which database should I use with Mahout

2013-05-22 Thread Johannes Schulte
Yeah, that is what i had in mind as a simple solution. For examining bigger result sets I always fear the cost of loading a lot of stored fields, that's why i thought of including it in the scoring might be cool. It's not possible with a plain Collector that maintains a priorityqueue of docid and s

Re: Hidden Markov Models and time series - 2 questions

2013-05-22 Thread yikes aroni
thanks for the reply ... I've discretized the continuous time series observations and assigned them to symbols. The number of hidden states is 2: "out of control" and "not out of control -- 0 and 1. With the scenario defined this way, i'm able to get good predictions from HMM. What i don't know how

Re: Which database should I use with Mahout

2013-05-22 Thread Ted Dunning
Yes what you are describing with diversification is something that I have called anti-flood. It comes from the fact that we really are optimizing a portfolio of recommendations rather than a batch of independent recommendations. Doing this from first principles is very hard but there are very s

how to run mahout examples from eclipse in linux?

2013-05-22 Thread qiaoresearcher
Hi all, Assume we want to run mahout examples like: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L it works well in command line, b

Re: Which database should I use with Mahout

2013-05-22 Thread Pat Ferrel
This data was for a mobile shopping app. Other answers below. > On May 21, 2013, at 5:42 PM, Ted Dunning wrote: > > Inline > > > On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel wrote: > >> In the interest of getting some empirical data out about various >> architectures: >> >> On Mon, May 20, 2

Re: Which database should I use with Mahout

2013-05-22 Thread Johannes Schulte
Okay i got it! I also always have used a basic form of dithering but we always called it shuffle since it basically was / is Collections.shuffle on a bigger list of results and therefore doesnt take the rank or score into account. Will try that.. With diversification i really meant more something

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen
I looked, and this job already uses a combiner called OuterProductCombiner. In fact it was right there in the stack trace, oops. At least, it shows this is happening in the mapper and the combiner is trying to do its job. I am still pretty sure both io.sort.* parameters are relevant here. Anyway

Re: mahout ssvd tuning problem

2013-05-22 Thread Dmitriy Lyubimov
i am actually not sure how to manipulate use of combiners in hadoop. All i can say that the code does make extensive use of combiners but they were always "on" for me. I had no idea one might turn their use off. On Wed, May 22, 2013 at 6:17 AM, Jakub Pawłowski wrote: > Yes, I was manipulating io

Re: Hidden Markov Models and time series - 2 questions

2013-05-22 Thread Ted Dunning
HMM's could be useful, but you have to define things a bit differently. First of all, HMM's want symbolic inputs and want to give you symbolic outputs. You don't get to see the internal state. My first approach would be to use k-means clustering on short sequences of your observed continuous var

Hidden Markov Models and time series - 2 questions

2013-05-22 Thread yikes aroni
I'm not knowledgable of statistics nor data analysis, so please be gentle! I am using Mahout to predict time series out of control state. I've had a fair amount of success classifying with SGD and Adaptive regression approaches but want to see if Hidden Markov Models can do a better job for my purp

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen
I mean you would have to write one and modify the code to use it. I don't know this job well enough to know whether it's possible or not though. At least, this is getting directly at reducing the amount of data spilled, rather than reducing the intermediate I/O needed to sort it. Doesn't io.sort.*

Re: mahout ssvd tuning problem

2013-05-22 Thread Jakub Pawłowski
Yes, I was manipulating io.sort.factor too, it speeds up reducer, values around 30 gives good result for me. But my problem is not reducer, my problem is Bt-job map taks that spills to drive. You mentioned Combiner, how can I turn it on ? I'm running my job from console like that mahout ssvd

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen
I feel like I've seen this too and it's just a bug. You're not running out of memory. Are you also setting io.sort.factor? that can help too. You might try as high as 100. Also have you tried a Combiner? if you can apply it it should help too as it is designed to reduce the amount of stuff spille

Re: Feature vector generation from Bag-of-Words

2013-05-22 Thread Suneel Marthi
See inline. From: Stuti Awasthi To: "'user@mahout.apache.org'" Sent: Wednesday, May 22, 2013 7:02 AM Subject: RE: Feature vector generation from Bag-of-Words Hi Suneel, I implemented your suggested approach. This was simple to implement and you have mad

mahout ssvd tuning problem

2013-05-22 Thread Jakub Pawłowski
Hi, I'm trying to tune mahout ssvd job to not spill so much, I'm trying to tune io.sort.mb 1047 but when I try to put any bigger value, ie. io.sort.mb 1247 according to hadoop source code this value can be as hight as 2047 http://grepcode.com/file/repository.cloudera.com/content/re

Re: WELCOME to user@mahout.apache.org

2013-05-22 Thread Jakub Pawłowski
Hi, I'm trying to tune mahout ssvd job to not spill so much, I'm trying to tune io.sort.mb 1047 but when I try to put any bigger value, ie. io.sort.mb 1247 according to hadoop source code this value can be as hight as 2047 http://grepcode.com/file/repository.cloudera.com/content/re

RE: Feature vector generation from Bag-of-Words

2013-05-22 Thread Stuti Awasthi
Hi Suneel, I implemented your suggested approach. This was simple to implement and you have made the steps pretty clear. Thankyou :) . I have few query in creating Features using Multiset: 1. Can't we consider keyword Case Insensitiveness using multiset i.e my keyword may be "Day" and in docum