Hi David, Interesting problem.
Just typing out loud here... Depending on how sloppy you want the match to be, and the max # of words that you'd want to consider as a prefix (or suffix) to an article, then one scalable approach (just considering prefix) is... 1. Pick a max # of words for the prefix, let's call this N 2. Pick a max slop value, let's call this S 3. Tokenize the first N words in the article 4. Output word/position pairs, for position +/- S. E.g. <word>, <position>, <word>, <position - 1> <word>, <position + 1> … 5. Calculate counts for each <word>,<position> 6. Group on <word>, sort by <position>, and merge counts for positions that are close enough (take the average position)_ 7. Calculate document frequency for each remaining <word>, <position> 8. Filter out any results with DF less than some threshold (e.g. 0.20) 9. Save the resulting <word>,<position> values All of the above I'd do in a map-reduce job (assuming you've got a significant amount of data), and I'd add in an implicit grouping by domain - unless you need this to span domains. Then when you get an article, tokenize it, then walk it (word by word). If you find a matching word with a position that's close enough, mask it, otherwise skip the word. When you get more than some count of skipped words in a row, or a total count, then you're done. -- Ken On Jun 3, 2014, at 12:16am, David Noel <[email protected]> wrote: > I'm clustering a pretty typical use case (news articles), but I keep > running into a problem that ends up ruining the final cluster quality: > noise, or "junk" sentences appended or prepended to the articles by > the news outlet. I removing common noise from datasets is a problem > common to many domains (news, bioinformatics, etc) so I figure there > must be some solution to it in existence already. Does anyone know of > any libraries to clean common strings from a set of strings (Java, > preferably)? > > I'm scraping pages from news outlets using HTMLUnit and passing the > output to Boilerpipe to extract the article contents. I've noticed > that Boilerpipe doesn't always do that great of a job. Often noise > will slip through and when I cluster the data the results are skewed > because of it. > > Examples of common "junk" sentences are as follows: > > -”Get Connected! MASNsports.com is your online home for the latest > Orioles and Nationals news, features, and commentary. And now, you can > connect with MASN on every digital level. From web and social media to > our new mobile alert service, MASN has got all the bases covered. Get > social!” > > -”Home KKTV firmly believes in freedom of speech for all and we are > happy to provide this forum for the community to share opinions and > facts. We ask that commenters keep it clean, keep it truthful, stay on > topic and be responsible. Comments left here do not necessarily > represent the viewpoint of KKTV 11 News. If you believe that any of > the comments on our site are inappropriate or offensive, please tell > us by clicking “Report Abuse” and answering the questions that follow. > We will review any reported comments promptly.” > > -”(TM and © Copyright 2014 CBS Radio Inc. and its relevant > subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS > Broadcasting Inc. Used under license. All Rights Reserved. This > material may not be published, broadcast, rewritten, or redistributed. > The Associated Press contributed to this report.)” > > -”(© Copyright 2014 The Associated Press. All Rights Reserved. This > material may not be published, broadcast, rewritten or > redistributed.)” > > ..and on. > > I've played around with a number of different methods to clean the > dataset prior to clustering: manually gathering and scrubbing common > substrings, using various LCS implementations (Longest Common > Subsequence), computing the Levenshtein distance for all possible > substrings, and on, but I've put a significant amount of time into > them and haven't had the greatest results. So I figure I'd ask if > anyone knows of any library that does something along the lines of > what I'm trying to do. Has anyone had any luck finding such a thing? > > Many thanks, > > -David -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
