There are a few problems with integrating with Lucene and Solr:
1) There are different Lucene index file formats. The Mahout tools have a
version of Lucene bound to them, and the workflows assume that generated
index files will be thrown away. Different Solr versions use different
Lucene index formats, and also add field new types to the indexes.
2) Solr&Lucene are not distributed databases. Hadoop likes data files that
are available to many machines at the same time. There are projects to embed
Lucene indexes into distributed databases, but there are no "official"
releases.
3) There is very little information about the text that you actually want to
use in a Mahout workflow, and that information is spread around the index.
This means a lot of file i/o, which is the bane of Hadoop projects.

You are best off pulling what you want to analyze out of Mahout into
sequence files of some sort. This would require your own program. Failing
that, recent Solr has a CSV output that you could spread among different
files.

On Mon, Aug 29, 2011 at 12:08 AM, Ramo Karahasan <
[email protected]> wrote:

> Hello Lance,
>
> So, that means, that currently no pure solr index is supported by mahout?
>
> Thanks,
> RK
>
> -----Ursprüngliche Nachricht-----
> Von: Lance Norskog [mailto:[email protected]]
> Gesendet: Montag, 29. August 2011 07:51
> An: [email protected]
> Betreff: Re: Workflow for categorization/classifiying
>
> If you create Lucene indexes in Solr which match the Lucene index formats
> used in the Mahout code, that is easiest.
>
> Otherwise, I would make a file input reader for Hadoop based on the SolrJ
> library. This would include a way to configure the actual query and response
> fields and how they map to the mapper inputs. If it reads too slowly, you
> can change the SolrJ library to the Embedded Solr app (which reads directly
> from the indexes instead of using a servlet container).
>
> On Sun, Aug 28, 2011 at 11:28 AM, Ramo Karahasan <
> [email protected]> wrote:
>
> > Thank you Ted,
> >
> > I'll have a look these days on the example.
> >
> > I guess, I'll take a copy of the ebook, the shipping costs of the
> > printed version are very high...
> >
> > Thanks,
> > RK
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Ted Dunning [mailto:[email protected]]
> > Gesendet: Sonntag, 28. August 2011 19:57
> > An: [email protected]
> > Betreff: Re: Workflow for categorization/classifiying
> >
> > See https://github.com/tdunning/Chapter-16 for example code.
> >
> > The book has a lot of background material on why things are as they
> > appear in the example but you should be able to get some benefit from
> > the example any way.
> >
> >
> > On Sunday, August 28, 2011, Ramo Karahasan
> > <[email protected]>
> > wrote:
> > > Hi,
> > >
> > > i'm primarily not looking fort he right algorithms, more for a way
> > > to implement this in web application that process the workflow
> "on-the-fly".
> > >
> > > Thanks,
> > > Ramo
> > >
> > > -----Ursprüngliche Nachricht-----
> > > Von: myn [mailto:[email protected]]
> > > Gesendet: Sonntag, 28. August 2011 18:33
> > > An: [email protected]
> > > Betreff: Re:AW: Workflow for categorization/classifiying
> > >
> > > I think  the Bayes orDecisionForest classfy method will bi Suitable
> > > look
> > at
> > > the follow link;
> > >
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Bayesian
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regressi
> > > on
> > >
> > >
> > >
> > >
> > > At 2011-08-28 23:59:03,"Ramo Karahasan"
> > > <[email protected]>
> > > wrote:
> > >>Hi Ted,
> > >>
> > >>no i was not looking at the book. I'd have to buy me one copy, but
> > >>the
> > > money problem ;).
> > >>What I want to do is quite simple... I have some manual chosen
> > > categories(topics), and a lot of documents on a webpage. This
> > > documents aren't good categorized/classified to the right topics.
> > > The documents all resists in a solr-index. The aim was to try a
> > > auto-categorization with Mahout, so to train the system
> > > incrementally every time new documents arrives, and to update the
> > > categories on the
> > website.
> > >>
> > >>Thanks,
> > >>RK
> > >>
> > >>-----Ursprüngliche Nachricht-----
> > >>Von: Ted Dunning [mailto:[email protected]]
> > >>Gesendet: Samstag, 27. August 2011 17:13
> > >>An: [email protected]
> > >>Betreff: Re: Workflow for categorization/classifiying
> > >>
> > >>Yes.  That is a reasonable work-flow.  Have you looked at the book
> > >>Mahout
> > > in Action (conflict alert, I am an author).  We provide extensive
> > > details
> > on
> > > how you can use categorization and clustering on real problems in
> > > the last two sections of the book.
> > >>
> > >>Also, if you say just a bit more about what you want to do, it would
> > >>be
> > > easier to help you.
> > >>
> > >>On Sat, Aug 27, 2011 at 6:01 AM, Ramo Karahasan <
> > > [email protected]> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> i wanted to ask, if there is a common workflow when trying to
> > >>> categorize/classify documents with mahout. For me one possible
> > >>> workflow  with solr could be:
> > >>>
> > >>> index documents into solr -> fetch data from solr -> prepare data
> > >>> for training -> operate training -> get data model -> operate with
> > >>> algorithms on data model -> get a result list -> ?
> > >>>
> > >>> Is that a possible workflow with Mahout and what to do after
> > >>> getting the processed categorizations? How would I make use of this
> result?
> > >>>
> > >>> Thanks,
> > >>> RK
> > >>>
> > >>> -----Ursprüngliche Nachricht-----
> > >>> Von: Sean Owen [mailto:[email protected]]
> > >>> Gesendet: Samstag, 27. August 2011 09:40
> > >>> An: [email protected]
> > >>> Betreff: Re: How to get recommendation demo example working
> > >>>
> > >>> No there is not.
> > >>>
> > >>> On Sat, Aug 27, 2011 at 8:33 AM, Ramo Karahasan <
> > >>> [email protected]> wrote:
> > >>>
> > >>> > Thank you Sean,
> > >>> >
> > >>> > i'll try that today.
> > >>> >
> > >>> > Is there an similar example for classification/classify  with an
> > >>> > web application?
> > >>> >
> > >>> >
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>
>
> --
> Lance Norskog
> [email protected]
>
>


-- 
Lance Norskog
[email protected]

Reply via email to