Noninformative terms aren't quite what is needed here.  Instead, you can
look for long repeated phrases.  Any phrase that is 8 words long that you
see more than 10 times is very likely a noise phrase.  The exact settings
should be tuned to your needs.



On Tue, Jun 3, 2014 at 2:08 AM, Vyacheslav Murashkin <[email protected]>
wrote:

> Hi David!
>
> Probably you can filter noninformative terms with tfidf convertion.
>
> Slava
>
> 2014-06-03 11:16 GMT+04:00 David Noel <[email protected]>:
> > I'm clustering a pretty typical use case (news articles), but I keep
> > running into a problem that ends up ruining the final cluster quality:
> > noise, or "junk" sentences appended or prepended to the articles by
> > the news outlet. I removing common noise from datasets is a problem
> > common to many domains (news, bioinformatics, etc) so I figure there
> > must be some solution to it in existence already. Does anyone know of
> > any libraries to clean common strings from a set of strings (Java,
> > preferably)?
> >
> > I'm scraping pages from news outlets using HTMLUnit and passing the
> > output to Boilerpipe to extract the article contents. I've noticed
> > that Boilerpipe doesn't always do that great of a job. Often noise
> > will slip through and when I cluster the data the results are skewed
> > because of it.
> >
> > Examples of common "junk" sentences are as follows:
> >
> > -”Get Connected! MASNsports.com is your online home for the latest
> > Orioles and Nationals news, features, and commentary. And now, you can
> > connect with MASN on every digital level. From web and social media to
> > our new mobile alert service, MASN has got all the bases covered. Get
> > social!”
> >
> > -”Home KKTV firmly believes in freedom of speech for all and we are
> > happy to provide this forum for the community to share opinions and
> > facts. We ask that commenters keep it clean, keep it truthful, stay on
> > topic and be responsible. Comments left here do not necessarily
> > represent the viewpoint of KKTV 11 News. If you believe that any of
> > the comments on our site are inappropriate or offensive, please tell
> > us by clicking “Report Abuse” and answering the questions that follow.
> > We will review any reported comments promptly.”
> >
> > -”(TM and © Copyright 2014 CBS Radio Inc. and its relevant
> > subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS
> > Broadcasting Inc. Used under license. All Rights Reserved. This
> > material may not be published, broadcast, rewritten, or redistributed.
> > The Associated Press contributed to this report.)”
> >
> > -”(© Copyright 2014 The Associated Press. All Rights Reserved. This
> > material may not be published, broadcast, rewritten or
> > redistributed.)”
> >
> > ..and on.
> >
> > I've played around with a number of different methods to clean the
> > dataset prior to clustering: manually gathering and scrubbing common
> > substrings, using various LCS implementations (Longest Common
> > Subsequence), computing the Levenshtein distance for all possible
> > substrings, and on, but I've put a significant amount of time into
> > them and haven't had the greatest results. So I figure I'd ask if
> > anyone knows of any library that does something along the lines of
> > what I'm trying to do. Has anyone had any luck finding such a thing?
> >
> > Many thanks,
> >
> > -David
>

Reply via email to