I'm clustering a pretty typical use case (news articles), but I keep running into a problem that ends up ruining the final cluster quality: noise, or "junk" sentences appended or prepended to the articles by the news outlet. I removing common noise from datasets is a problem common to many domains (news, bioinformatics, etc) so I figure there must be some solution to it in existence already. Does anyone know of any libraries to clean common strings from a set of strings (Java, preferably)?
I'm scraping pages from news outlets using HTMLUnit and passing the output to Boilerpipe to extract the article contents. I've noticed that Boilerpipe doesn't always do that great of a job. Often noise will slip through and when I cluster the data the results are skewed because of it. Examples of common "junk" sentences are as follows: -”Get Connected! MASNsports.com is your online home for the latest Orioles and Nationals news, features, and commentary. And now, you can connect with MASN on every digital level. From web and social media to our new mobile alert service, MASN has got all the bases covered. Get social!” -”Home KKTV firmly believes in freedom of speech for all and we are happy to provide this forum for the community to share opinions and facts. We ask that commenters keep it clean, keep it truthful, stay on topic and be responsible. Comments left here do not necessarily represent the viewpoint of KKTV 11 News. If you believe that any of the comments on our site are inappropriate or offensive, please tell us by clicking “Report Abuse” and answering the questions that follow. We will review any reported comments promptly.” -”(TM and © Copyright 2014 CBS Radio Inc. and its relevant subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS Broadcasting Inc. Used under license. All Rights Reserved. This material may not be published, broadcast, rewritten, or redistributed. The Associated Press contributed to this report.)” -”(© Copyright 2014 The Associated Press. All Rights Reserved. This material may not be published, broadcast, rewritten or redistributed.)” ..and on. I've played around with a number of different methods to clean the dataset prior to clustering: manually gathering and scrubbing common substrings, using various LCS implementations (Longest Common Subsequence), computing the Levenshtein distance for all possible substrings, and on, but I've put a significant amount of time into them and haven't had the greatest results. So I figure I'd ask if anyone knows of any library that does something along the lines of what I'm trying to do. Has anyone had any luck finding such a thing? Many thanks, -David
