Re: Design Question

Ted Dunning Sun, 31 Oct 2010 13:30:51 -0700

There is no direct support for this in Mahout, but some of the underpinnings
are there.  One thought that I have is that
the data involved in these processes are not usually massive and can be
handled using conventional systems.  The
reason that the scale isn't so large is that you either a very low event
rate which means that the total number of events
is small or you have a high event rate in which the underlying Poisson
parameters vary quite slowly relative to the inter-arrival
time.  This means that you can measure counts over time periods that are
still pretty short with respect to the parmeter rate
of change and have small data again.


Given this, my suggestions are to do one or more of the following:

- use JAGS in R or BUGS for doing the hierarchical Bayesian modeling
described in this paper

- use raw R to build an MCMC sampler for this model

- experiment with variational optimization of this model

- consider simplifying the MMPP model by directly estimating the output of
the Markov model using something like a Kalman filter and short time
averages for rate parameters.  This gives an incredibly simple model with
very good performance.  For instance, I have done
this to create a system to alert when sales on a web site stopped happening.
 The method I used was to use hourly estimates of rates
and build a linear model based on the rate for the same hour one week ago
and one day ago.  Then, I could build a Poisson process
alert based simply on inter-arrival time and desired false positive rate.
 Normally I set the false positive rate to about one alarm per
week or two.  This worked extremely well.

- when you need to deploy the system and know specifically what you want to
do, come back to Mahout to code the system
using the basic numerical mathematical algorithms that you have developed in
the first three options.

The reason that I suggest this is that Mahout is not a super efficient
experimental platform because for experimental purposes,
efficiency is measured in developer time, not run time.  Mahout does provide
good deployment efficiency because it supports scaling
well, but this comes at a developer time cost.

Speak up if my suggestions are silly.  You certainly know your problem
better than I do.

And while you are at it, can you say what you data represent?  Can you
publish your data?


On Sun, Oct 31, 2010 at 1:17 AM, Mubarak Seyed <[email protected]>wrote:

> Thanks Ted.
>
> Is there any way to use MMPP (Markov-manipulated Poisson process) algorithm
> (www.datalab.uci.edu/papers/tkdd07.pdf) in Mahout 0.4?
> Can you please direct me to some examples?
>
> Thanks,
> Mubarak
>
>
> On Wed, Oct 20, 2010 at 4:06 PM, Ted Dunning <[email protected]>
> wrote:
>
> > For many situations, this can be done very simply, especially if you are
> > working web-based systems.  For that case,
> > it is straightforward to model transactions coming as a Poisson process
> > with
> > a time varying rate.  In the simplest case,
> > very simple seasonality models can be used to estimate the time varying
> > rate.  I have used hourly estimates from one
> > day ago and one week ago as good indicators in the past.  These
> indicators
> > did not model long weekends as well as I would
> > have liked, the the alarms based on these models were better than any
> other
> > system available.  Long-term seasonality
> > was handled very well because of the short term nature of the expected
> > volume estimates.  For tighter bounds,
> > it should be possible to use something akin to generalized linear models
> to
> > incorporate more information to get better
> > rate predictions.  Since the failures I was trying to detected quickly
> were
> > typically total failures, I just had to raise an alert
> > as quickly as possible when the inter-transaction time exceeded a
> > reasonable
> > bound.  For a specified false positive rate,
> > this was very easily done and results were very nearly optimal.  More
> > importantly, the alerts almost always were faster
> > than our CEO who had an eagle eye for these things.
> >
> > For brick-and-mortar systems, this can be a bit more difficult because
> > business practices tend to cause some very irregular
> > volumes.  If you are dealing with transactions that are being reported in
> > real-time rather than in batches, then you should be
> > fine.  Batch reporting based on human triggers could probably be handled
> > using longer/softer rate averaging windows, however.
> >
> > I really don't expect that you need anything all that fancy for the rate
> > estimation.
> >
> > Can you say more about your data?  Can you post anonymous sample data for
> a
> > two week period?
> >
> > On Tue, Oct 19, 2010 at 11:26 PM, Mubarak Seyed <[email protected]
> > >wrote:
> >
> > > My requirements are as follows:
> > >
> > > - Client system does the transaction using hub, we have a historical
> data
> > > and we can predict the trends of min/avg/max number of transaction for
> a
> > > given interval
> > > - Using the historical data, mine the data, need to find the
> predictions
> > > - Need to build a intelligent system (using ML technique, neural
> network
> > > algorithms) if there is no transaction for a client in the given
> > prediction
> > > range then system needs to send alarms
> > >
> > >
> > > For example, Walmart sells gift cards, each sale is a transaction and
> it
> > > needs to come to main system (from hub), we have a historical data for
> > > WalMart for sales (for each day, each hour, each 10 mins, peak volume,
> > > holiday season), if there is no transaction from WalMart for X range of
> > > time
> > > and that range does not fall in a prediction data, then intelligent
> > systems
> > > needs to raise an alarm.
> > >
> >
>
>
>
> --
> Thanks,
> Mubarak Seyed.
>

Re: Design Question

Reply via email to