Re: GSoC 2009-Discussion

2009-03-24 Thread deneche abdelhakim

talking about Random Forests, I think there are two possible ways to actually 
implement them:

The first implementation is useful when the dataset is not that big (<= 2Go 
perhaps) and thus can be distributed via Hadoop's DistributedCache. In this 
case each mapper has access to all the dataset and builds a subset of the 
forest.

The second one is related to large datasets, and by large I mean datasets that 
cannot fit on every computing node. In this case each mapper processes a subset 
of the dataset for all the trees.

Im more interested in the second implementation, so may be Samuel would be 
interested in the first...but of course if actually the community need them 
both :)

--- En date de : Mar 24.3.09, Ted Dunning  a écrit :

> De: Ted Dunning 
> Objet: Re: GSoC 2009-Discussion
> À: mahout-dev@lucene.apache.org
> Date: Mardi 24 Mars 2009, 0h07
> There are other algorithms of serious
> interest.  Bayesian Additive
> Regression Trees (BART) would make a very interesting
> complement to Random
> Forests.  I don't know how important it is to get a
> normal decision tree
> algorithm going because the cost to build these is often
> not that high.
> Boosted decision trees might be of interest, but probably
> not as much as
> BART.
> 
> It might also be interesting to work with this student to
> implement some of
> the diagnostics associated with random forests.  There
> is plenty to do.
> 
> 
> - Original Message 
> 
> > > From: Samuel Louvan 
> >
> > My questions:
> > > - I just notice in the mailing archive that other
> student also pretty
> > > serious to implement random forest algorithm.
> Should I select
> > >   decision tree instead ? (for my
> future GSoC proposal)
> > > - Actually I found it would be interesting if I
> can combine Apache
> > > Nutch and Mahout so the idea is to implement web
> page segmentation +
> > > classifier inside
> > >   a web crawler. By doing this, a
> crawler, for instance, can use the
> > > output of the classification to  only follow
> certain links that lie on
> > > informative content parts.
> > >   Is this interesting & make
> sense for you guys?
> >
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 





Re: GSoC 2009-Discussion

2009-03-23 Thread Ted Dunning
There are other algorithms of serious interest.  Bayesian Additive
Regression Trees (BART) would make a very interesting complement to Random
Forests.  I don't know how important it is to get a normal decision tree
algorithm going because the cost to build these is often not that high.
Boosted decision trees might be of interest, but probably not as much as
BART.

It might also be interesting to work with this student to implement some of
the diagnostics associated with random forests.  There is plenty to do.


- Original Message 

> > From: Samuel Louvan 
>
> My questions:
> > - I just notice in the mailing archive that other student also pretty
> > serious to implement random forest algorithm. Should I select
> >   decision tree instead ? (for my future GSoC proposal)
> > - Actually I found it would be interesting if I can combine Apache
> > Nutch and Mahout so the idea is to implement web page segmentation +
> > classifier inside
> >   a web crawler. By doing this, a crawler, for instance, can use the
> > output of the classification to  only follow certain links that lie on
> > informative content parts.
> >   Is this interesting & make sense for you guys?
>



-- 
Ted Dunning, CTO
DeepDyve


Re: GSoC 2009-Discussion

2009-03-23 Thread Otis Gospodnetic

Mmmm :)  This would definitely be very useful to anyone dealing with web 
page parsing and indexing.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Samuel Louvan 
> To: mahout-dev@lucene.apache.org
> Sent: Sunday, March 22, 2009 7:17:11 PM
> Subject: GSoC 2009-Discussion
> 
> Hi,
> I just browsed through the idea list in GSoC 2009 and I'm interested
> to work in Apache Mahout.
> Currently, I'm doing my master project in my university related to
> machine learning + information retrieval. More specifically
> it's about how to discover informative content in a web page by using
> machine learning approach.
> 
> Overall, there are two stages for doing this task, namely web page
> segmentation and locating the informative content.
> Web page segmentation process, takes a DOM tree representation of a
> HTML document and then group the DOM nodes
> into certain granularity. Next, a classification task is performed to
> the DOM nodes into binary class whether it is
> a informative content or non-informative content. The features used
> for the classification are for example, inner HTML length,
> inner Text Length, stop word ratio, offsetHeight, coordinate of the
> HTML element on the browser etc.
> 
> The dataset is generated by a labeling program that I made (for
> supervised learning). Basically, a user can
> select & annotate a particular segment of the web page and then mark
> the class label as a informative content or not informative content.
> 
> I did some small experiments with this last semester, I played with
> WEKA and tried some algorithms namely Random forests,
> Decision tree, SVM, and Neural Network. In this experiment, random
> forest and decision tree yield the most satisfying result.
> 
> Currently, I'm working on my master project and will implement a
> machine learning algorithm either decision tree or random forest
> for the classifier. For this reason, I'm very interested to work on
> Apache Mahout in this year's GSoC to implement one of those
> algorithm.
> 
> 
> My questions:
> - I just notice in the mailing archive that other student also pretty
> serious to implement random forest algorithm. Should I select
>   decision tree instead ? (for my future GSoC proposal)
> - Actually I found it would be interesting if I can combine Apache
> Nutch and Mahout so the idea is to implement web page segmentation +
> classifier inside
>   a web crawler. By doing this, a crawler, for instance, can use the
> output of the classification to  only follow certain links that lie on
> informative content parts.
>   Is this interesting & make sense for you guys?
> 
> Maybe for more details, you can download my presentation slides and
> master project desription at
> http://rapidshare.com/files/212352116/Slide_Doc.zip
> 
> A little bit background of me : I'm a 2nd year Master Student in TU
> Eindhoven, Netherlands.
> Last year I also participated in GSoC with OpenNMS
> (http://code.google.com/soc/2008/opennms/appinfo.html?csaid=EDA725BD4D34D481)
> 
> 
> Looking forward for your feedback and input.
> 
> 
> 
> Regards,
> Samuel L.



Re: GSoC 2009-Discussion

2009-03-23 Thread Dawid Weiss


> [snip]

  a web crawler. By doing this, a crawler, for instance, can use the
output of the classification to  only follow certain links that lie on
informative content parts.
  Is this interesting & make sense for you guys?


Hi Samuel. This would be of great interest for the Nutch folks, I think. And 
obviously for Mahout, since it would be a practical application of an ML algorithm.


Dawid


GSoC 2009-Discussion

2009-03-22 Thread Samuel Louvan
Hi,
I just browsed through the idea list in GSoC 2009 and I'm interested
to work in Apache Mahout.
Currently, I'm doing my master project in my university related to
machine learning + information retrieval. More specifically
it's about how to discover informative content in a web page by using
machine learning approach.

Overall, there are two stages for doing this task, namely web page
segmentation and locating the informative content.
Web page segmentation process, takes a DOM tree representation of a
HTML document and then group the DOM nodes
into certain granularity. Next, a classification task is performed to
the DOM nodes into binary class whether it is
a informative content or non-informative content. The features used
for the classification are for example, inner HTML length,
inner Text Length, stop word ratio, offsetHeight, coordinate of the
HTML element on the browser etc.

The dataset is generated by a labeling program that I made (for
supervised learning). Basically, a user can
select & annotate a particular segment of the web page and then mark
the class label as a informative content or not informative content.

I did some small experiments with this last semester, I played with
WEKA and tried some algorithms namely Random forests,
Decision tree, SVM, and Neural Network. In this experiment, random
forest and decision tree yield the most satisfying result.

Currently, I'm working on my master project and will implement a
machine learning algorithm either decision tree or random forest
for the classifier. For this reason, I'm very interested to work on
Apache Mahout in this year's GSoC to implement one of those
algorithm.


My questions:
- I just notice in the mailing archive that other student also pretty
serious to implement random forest algorithm. Should I select
  decision tree instead ? (for my future GSoC proposal)
- Actually I found it would be interesting if I can combine Apache
Nutch and Mahout so the idea is to implement web page segmentation +
classifier inside
  a web crawler. By doing this, a crawler, for instance, can use the
output of the classification to  only follow certain links that lie on
informative content parts.
  Is this interesting & make sense for you guys?

Maybe for more details, you can download my presentation slides and
master project desription at
http://rapidshare.com/files/212352116/Slide_Doc.zip

A little bit background of me : I'm a 2nd year Master Student in TU
Eindhoven, Netherlands.
Last year I also participated in GSoC with OpenNMS
(http://code.google.com/soc/2008/opennms/appinfo.html?csaid=EDA725BD4D34D481)


Looking forward for your feedback and input.



Regards,
Samuel L.