2011/11/18 Alex Lopez <[email protected]>:
> I wanted to share my 2 cents about the classification using Stanbol as I had
> relatively good results applying Olivier's method (using MoreLikeThis to
> compare the input text with wikipedia abstracts) within my Stanbol instance
> running a dbpedia index:
>
> Using RemoteStreaming to classify remote plain text (in this example some
> RFC about mail) on a default Stanbol using full launcher:
>
> http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/
>
> or if a better index has been loaded (dbpedia) with indexed abstracts:
>
> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/
>
> Then process results: infer common broader categories, etc.

Nice to see that you experimented further with this idea. For the
broader category structure we have the information in the dbpedia skos
graph.

> Just to make some tests I extracted the most-repeated broader categories
> using all dc:subject with the above text and yielded:
>
> Internet
> Email
> Internet_protocols
> World_Wide_Web
> Application_layer_protocols
>
> Another example using a Portuguese text (bible fragment):
>
> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/
>
> Categories_named_after_religious_texts

This kind of categories are noisy technical boilerplate and should not
be indexed. My pignlproc scripts should take care of that.

> Christian_liturgy,_rites,_and_worship_services
> Christian_theology
>
> It works for me :)
>
> However in the on-line instances I tested, the SOLR server didn't seem to be
> exposed (as it is in last Stanbol revisions) so I can't give any
> ready-to-see working example.

Yes we need to work on that :) I think Rupert has already started.

> Thanks Olivier for the great idea!

I have plenty of ideas to improve the quality further by using mahout
and the sparse prior logistic regression to trim down the index to
only keep the most discriminative words (and bi-grams) for each
category only. This should both reduce the size of the index, improve
the processing speed and the quality of the predictions.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to