I agree with you that there is preparation needed for Mahout processing.

I was just trying to save on that effort by re-using the data in hive
instead of double processing it.

I may have some more questions when I actually dive into the mining part.
(possibly a couple of months down the line).

Thanks for your inputs.

On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <[email protected]> wrote:

> Hive does something fairly unrelated to Mahout. It's an indexing and
> query system. Both might start from the same source data, but to do
> different things. There is no common format, no. Mahout generally
> operates on text files or "Vectors" in SequenceFiles. So there's some
> translation there at least.
>
> But I think a message here is that there's more preparation and
> thought necessary to start data mining. It's not like you point a data
> mining tool at some data and answers start flowing automatically.
> You'd have to be deliberately extracting and preparing data anyhow.
>
> On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <[email protected]> wrote:
> > Thanks Sean for the answers. Thanks for Ted for validation.
> >
> > Now my question is, since I want to do both reporting of large data/
> > datawarehouse, let's assume I choose Hive for that.
> >
> > Now can Mahout integrate with Hive to make use of this data for learning,
> > mining etc.? or do I have to export the hive data into text files which
> can
> > be hosted by Haddop/HDFS which later on Mahout can use for data mining.
> >
> > In short, can data warehousing part be done by Hive and then can data
> mining
> > part be done by Mahout on this hive data?
> >
> > -H
> >
> > On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <[email protected]> wrote:
> >
> >> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <[email protected]> wrote:
> >> > Per my understanding of hive, we can do some statistical reporting,
> like
> >> > frequency of user sessions, which geographical region, which device he
> is
> >> > using the most etc.
> >>
> >> Yes that's about what Hive is good for, if you're looking for some
> >> open-source libraries along those lines.
> >>
> >> >
> >> > But we also want to mine this data to get some predictive capabilities
> >> like
> >> > what is the likelihood that the user will use the same device again or
> if
> >> we
> >> > get sales/marketing data (on the roadmap for future), we want to
> possibly
> >> > predict which region to put more marketing/sales efforts. What is the
> >> > pattern for growth of user base, in which geographical regions etc.
> What
> >> is
> >> > the pattern of user requests failing and a number of requirements like
> >> these
> >> > from the business.
> >>
> >> This is pretty broad but I can try to give you the names of problems
> >> this sounds like, to guide your search.
> >>
> >> Predicting user usage of device sounds like a classification problem,
> >> like developing a probabilistic model of behavior.
> >>
> >> Deciding where to put marketing dollars sounds like a business
> >> problem, not machine learning. I don't think a computer can tell you
> >> that. Some techniques might help you identify trends in sales, but
> >> this is simple regression, not really machine learning.
> >>
> >> Looking for patterns in failure sounds a bit like frequent pattern
> >> mining -- trying to find events that go together unusually often.
> >>
> >
>

Reply via email to