Re: [Haskell-cafe] Finding longest common prefixes in a list
On Sat, Jan 21, 2012 at 8:18 AM, Twan van Laarhoven twa...@gmail.com wrote: Notice that there are lots of miku-X prefixes found. This is probably not what you want. What exactly do you want the algorithm to do? For example, is obviously a prefix of every string, but it is not very long. On the other hand, each string is a prefix of itself, but that prefix is shared by only one string (usually). By the way, the sort and compare adjacent pairs approach corresponds to atLeastThisManyDescendants 2. Ah, now the code makes sense to me. It's longer, but it is a heck of a lot more principled and readable, so I'm happy to replace my version with yours. It's not too hard to convert it into a CLI filter with optional depth (default of 2, replicating original behavior): import qualified Data.Map as Map import System.Environment (getArgs) import Data.List (sortBy) import Data.Ord (comparing) main :: IO () main = do arg - getArgs let n = if null arg then 2 else read (head arg) :: Int interact (unlines . chunk n . lines) chunk :: Int - [String] - [String] chunk n = map prefix . sortByLength . atLeastThisManyDescendants n . fromList where sortByLength :: [CommonPrefix Char] - [CommonPrefix Char] sortByLength = sortBy (comparing (numDescendant . names)) . And the results seem kosher (printing just the prefixes is probably the best idea, but wouldn't be too hard to switch to printing full filenames - just filter the original file list with the extracted prefix from each CommonPrefix): $ ls music/vocaloid/| runhaskell lcp.hs 5 miku-s miku-t miku-r rin- miku-a gumi- luka- $ ls music/vocaloid/| runhaskell lcp.hs 4 miku-h miku-m miku-n miku-p miku-s miku-t miku-r rin- miku-a gumi- luka- $ ls music/vocaloid/| runhaskell lcp.hs # with 2 chorus- gumi-mo gumi-s kaito- luka-emon luka-t miku-acolorlinkingworld- miku-akayaka miku-cleantears-remind2011natsu- miku-dan miku-ele miku-galaxyodyssey- miku-ha miku-inn miku-jemappelle-motion- miku-kz- miku-lo miku-m@rk- miku-plustellia-壁の彩度- miku-ro miku-se miku-ta miku-the miku-tinyparadise- miku-ジラートP-birthdayofeden- miku-杯本選 miku-般若心経 niconicochorus- yuki- len- luka-di miku-re:package- miku-n rin- -- gwern http://www.gwern.net import qualified Data.Map as Map import System.Environment (getArgs) import Data.List (sortBy) import Data.Ord (comparing) main :: IO () main = do arg - getArgs let n = if null arg then 2 else read (head arg) :: Int interact (unlines . chunk n . lines) chunk :: Int - [String] - [String] chunk n = map prefix . sortByLength . atLeastThisManyDescendants n . fromList where sortByLength :: [CommonPrefix Char] - [CommonPrefix Char] sortByLength = sortBy (comparing (numDescendant . names)) -- A trie datatype data Trie a = Trie { numLeafs, numDescendant :: !Int , children :: Map.Map a (Trie a) } instance (Show a) = Show (Trie a) where showsPrec _ t = showString fromList . shows (toList t) -- The empty trie empty :: Trie a empty = Trie 0 0 Map.empty -- A trie that contains a single string singleton :: Ord a = [a] - Trie a singleton [] = Trie 1 1 Map.empty singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs)) -- Merge two tries merge :: Ord a = Trie a - Trie a - Trie a merge (Trie l d c) (Trie l' d' c') = Trie (l+l') (d+d') (Map.unionWith merge c c') fromList :: Ord a = [[a]] - Trie a fromList = foldr (merge . singleton) empty toList :: Trie a - [[a]] toList (Trie l _ c) = replicate l [] ++ [ x:xs | (x,t) - Map.toList c, xs - toList t ] data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a } instance (Show a) = Show (CommonPrefix a) where showsPrec _ (Prefix p ns) = shows p . showString ++ . shows (toList ns) -- Find prefixes that have at least minD descendants. -- when there is a prefix xs with =minD descendants, then shorter prefixes will not be returned atLeastThisManyDescendants :: Int - Trie a - [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie _ d c) | d minD = [] -- too few descendants | null forChildren = [Prefix [] trie] -- all longer prefixes have too few descendants, but this prefix doesn't | otherwise = forChildren -- there are longer prefixes with enough descendants, return them where forChildren = [ Prefix (x:pfx) nms | (x,t) - Map.toList c , Prefix pfx nms - atLeastThisManyDescendants minD t ] {- *Main mapM_ (print . prefix) $ atLeastThisManyDescendants 4 test1 gumi- luka- miku-a miku-h miku-m miku-n miku-p miku-r miku-s miku-t rin- test1 :: Trie Char test1 = fromList [chorus-kiminoshiranaimonogatari.ogg ,chorus-mrmusic.ogg ,choucho-lastnightgoodnight.ogg ,dylanislame-aikotoba.ogg ,electriclove-エレクトリック・ラブ-korskremix.ogg ,gumi-bacon8-justhangingaround.ogg ,gumi-iapologizetoyou.ogg ,gumi-montblanc.ogg ,gumi-mozaikrole.ogg ,gumi-ハッピーシンセサイザ.ogg
Re: [Haskell-cafe] Finding longest common prefixes in a list
On 2012-01-20 23:44, Gwern Branwen wrote: On Fri, Jan 20, 2012 at 1:57 PM, Twan van Laarhoventwa...@gmail.com wrote: Here is some example code (untested): Well, you're right that it doesn't work. I tried to fix the crucial function, 'atLeastThisManyDescendants', but it's missing something because varying parts doesn't much affect the results when I try it out on example input - it either returns everything or nothing, it seems: atLeastThisManyDescendants :: Int - Trie a - [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t') | d minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) nms | (x,t) - Map.toList t' , Prefix pfx nms - atLeastThisManyDescendants l t ] It should be atLeastThisManyDescendants minD t, minD is a threshold for the minimum numer of descendants, and it stays the same in the recursive call. That's what you get for not testing your code :) With the correct function I get a result like: *Main mapM_ (print . prefix) $ atLeastThisManyDescendants 4 test1 gumi- luka- miku-a miku-h miku-m miku-n miku-p miku-r miku-s miku-t rin- Notice that there are lots of miku-X prefixes found. This is probably not what you want. What exactly do you want the algorithm to do? For example, is obviously a prefix of every string, but it is not very long. On the other hand, each string is a prefix of itself, but that prefix is shared by only one string (usually). By the way, the sort and compare adjacent pairs approach corresponds to atLeastThisManyDescendants 2. Twan import qualified Data.Map as Map -- A trie datatype data Trie a = Trie { numLeafs, numDescendant :: !Int , children :: Map.Map a (Trie a) } instance (Show a) = Show (Trie a) where showsPrec _ t = showString fromList . shows (toList t) -- The empty trie empty :: Trie a empty = Trie 0 0 Map.empty -- A trie that contains a single string singleton :: Ord a = [a] - Trie a singleton [] = Trie 1 1 Map.empty singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs)) -- Merge two tries merge :: Ord a = Trie a - Trie a - Trie a merge (Trie l d c) (Trie l' d' c') = Trie (l+l') (d+d') (Map.unionWith merge c c') fromList :: Ord a = [[a]] - Trie a fromList = foldr merge empty . map singleton toList :: Trie a - [[a]] toList (Trie l _ c) = replicate l [] ++ [ x:xs | (x,t) - Map.toList c, xs - toList t ] data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a } instance (Show a) = Show (CommonPrefix a) where showsPrec _ (Prefix p ns) = shows p . showString ++ . shows (toList ns) -- Find prefixes that have at least minD descendants. -- when there is a prefix xs with =minD descendants, then shorter prefixes will not be returned atLeastThisManyDescendants :: Int - Trie a - [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d c) | d minD = [] -- too few descendants | null forChildren = [Prefix [] trie] -- all longer prefixes have too few descendants, but this prefix doesn't | otherwise = forChildren -- there are longer prefixes with enough descendants, return them where forChildren = [ Prefix (x:pfx) names | (x,t) - Map.toList c , Prefix pfx names - atLeastThisManyDescendants minD t ] test1 = fromList [chorus-kiminoshiranaimonogatari.ogg ,chorus-mrmusic.ogg ,choucho-lastnightgoodnight.ogg ,dylanislame-aikotoba.ogg ,electriclove-ã¨ã¬ã¯ããªãã¯ã»ã©ã-korskremix.ogg ,gumi-bacon8-justhangingaround.ogg ,gumi-iapologizetoyou.ogg ,gumi-montblanc.ogg ,gumi-mozaikrole.ogg ,gumi-ãããã¼ã·ã³ã»ãµã¤ã¶.ogg ,gumi-showasengirl.ogg ,gumi-sweetfloatflatsã¹ã¤ã¼ãããã¼ãã¢ãã¼ã.ogg ,gumi-timewarpedafterchoppingmystagbeetle.ogg ,gumi-ãªãªã¸ãã«æ²-ä»ããã·ã¡ã°ãª.ogg ,gumi-ãã¯ãªãªã¸ãã«è¦ªå.ogg ,kaito-byakkoyano.ogg ,kaito-flowertail.ogg ,kasaneteto-tam-ochamekinouéé³ããå¹ã£åãããã¡ããæ©è½.ogg ,len-crime-timetosaygoodbye.ogg ,len-fireâflower.ogg ,len-ponponpon.ogg ,lily-prototype.ogg ,luka-apolxcore-waitingforyou.ogg ,luka-dimããã¤.ogg ,luka-dion-myheartwillgoon.ogg ,luka-dirgefilozofio-dirgeasleepinjesus.ogg ,luka-ã¢ã´ã¢ãã-doubelariatããã«ã©ãªã¢ãã.ogg ,luka-emon-heartbeats.ogg ,luka-emonloid3-ããã¼ããã¼.ogg ,luka-everybreathyoutake.ogg ,luka-ãªãªã¸ãã«-garden.ogg ,luka-justbefriends.ogg ,lukameiko-gemini.ogg ,luka-milkyway.ogg ,luka-ãã¿ãã-ããã.ogg ,luka-tic-tick.ogg ,luka-torinouta.ogg ,luka-zeijakukei-shounenshoujo.ogg ,luka-åæã«ã¢ãã¡-nologic-ä½ã£ã¦ã¿ã.ogg ,luka-é§ç®äººé.ogg ,meiko-artemis-awake.ogg ,miku-9ronicleãã©ãã.ogg ,miku-acolorlinkingworld-ãã®ä¸çã®ä¸ã§.ogg ,miku-acolorlinkingworld-éãè±.ogg
[Haskell-cafe] Finding longest common prefixes in a list
Recently I wanted to sort through a large folder of varied files and figure out what is a 'natural' folder to split out, where natural means something like 4 files with the same prefix. (This might be author, genre, subject, whatever I felt was important when I was naming the file.) Now usually I name files with hyphens as the delimiters like the hypothetical '1998-wadler-monads.pdf', and it would be easy to write a stdin/stdout filter to break Strings on hyphens and sort by whatever is most common. But this is rather hardwired, can I solve the more general problem of finding the longest common prefixes, whatever they are? This turns out to be much more difficult than simply finding 'the' longest common prefix (which is usually ). I found an algorithm of sorts at http://stackoverflow.com/a/6634624 but it was easier described than implemented. Eventually I wrote what I *think* is a correct program, but it's definitely of the write-only sort. Perhaps people have better implementations somewhere? I saw a lot of discussion of tries, but I didn't go that route. The code, followed by an example: #!/usr/bin/env runhaskell import Data.List (intercalate, isPrefixOf, nub, sort) main :: IO () main = interact (unlines . intercalate [] . chunkFiles . lines ) -- basic algorithm from http://stackoverflow.com/a/6634624 chunkFiles :: Ord a = [[a]] - [[[a]]] chunkFiles f = map (\(_,b) - filter (isPrefixOf b) f) $ sort $ map (\x - (countPrefixes x f,x)) (e $ bar f) sharedPrefixes :: Ord a = [[a]] - [a] sharedPrefixes [] = [] sharedPrefixes s = foldr1 sp2 s where sp2 l1 l2 = map fst . takeWhile (uncurry (==)) $ zip l1 l2 traverse :: Ord a = [[a]] - [[a]] traverse [] = [] traverse x = sharedPrefixes (take 2 x) : traverse (drop 1 x) bar :: Ord a = [[a]] - [[a]] bar = nub . sort . traverse . sort countPrefixes :: (Ord a) = [a] - [[a]] - Int countPrefixes x xs = length $ filter (x `isPrefixOf`) xs e :: Eq a = [[a]] - [[a]] e y = map fst $ filter snd $ map (\x - (x, (==) 1 $ length . filter id $ map (x `isPrefixOf`) y)) y {- Example input from `ls`: chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg choucho-lastnightgoodnight.ogg dylanislame-aikotoba.ogg electriclove-エレクトリック・ラブ-korskremix.ogg gumi-bacon8-justhangingaround.ogg gumi-iapologizetoyou.ogg gumi-montblanc.ogg gumi-mozaikrole.ogg gumi-ハッピーシンセサイザ.ogg gumi-showasengirl.ogg gumi-sweetfloatflatsスイートフロートアパート.ogg gumi-timewarpedafterchoppingmystagbeetle.ogg gumi-オリジナル曲-付きホシメグリ.ogg gumi-ミクオリジナル親友.ogg kaito-byakkoyano.ogg kaito-flowertail.ogg kasaneteto-tam-ochamekinou重音テト吹っ切れたおちゃめ機能.ogg len-crime-timetosaygoodbye.ogg len-fire◎flower.ogg len-ponponpon.ogg lily-prototype.ogg luka-apolxcore-waitingforyou.ogg luka-dimトロイ.ogg luka-dion-myheartwillgoon.ogg luka-dirgefilozofio-dirgeasleepinjesus.ogg luka-アゴアニキ-doubelariatダブルラリアット.ogg luka-emon-heartbeats.ogg luka-emonloid3-ハローハロー.ogg luka-everybreathyoutake.ogg luka-オリジナル-garden.ogg luka-justbefriends.ogg lukameiko-gemini.ogg luka-milkyway.ogg luka-やみくろ-かいぎ.ogg luka-tic-tick.ogg luka-torinouta.ogg luka-zeijakukei-shounenshoujo.ogg luka-勝手にアニメ-nologic-作ってみた.ogg luka-駄目人間.ogg meiko-artemis-awake.ogg miku-9ronicleプラチナ.ogg miku-acolorlinkingworld-この世界の下で.ogg miku-acolorlinkingworld-青い花.ogg miku-a+jugos-lullabyforkindness.ogg miku-akayaka-beacon.ogg miku-akayakap-sunrise.ogg miku-aoihana.ogg miku-arabianresponse.ogg miku-avtechno-tear.ogg miku-こえをきかせてcicci.ogg miku-cleantears-remind2011natsu-greenhillzonecrystiararemix.ogg miku-cleantears-remind2011natsu-夏影summerwindremix.ogg miku-clocklockworks.ogg miku-dancedancevol2-runner.ogg miku-daniwellp-chaoticuniverse.ogg miku-dixieflatline-shinonomescrumble.ogg miku-electricloveエレクトリックラヴ.ogg miku-elegumitokyo-kissmebaby.ogg miku-galaxyodyssey-cryingirl.ogg miku-galaxyodyssey-galaxyspacelines.ogg miku-hakamairi.ogg miku-haruna.ogg miku-heartshooter.ogg miku-hoshikuzutokakera.ogg miku-innes.ogg miku-innocence初音ミク.ogg miku-jemappelle-motion-likeyou.ogg miku-jemappelle-motion-ohwell.ogg miku-jevannip-myfavoritesummer.ogg miku-kakokyuudance-過呼吸ダンス.ogg miku-kz-packaged.ogg miku-kz-tellyourworld.ogg miku-lastscene.ogg miku-lostmemories付き-初音ミク.ogg miku-lovelyday.ogg miku-いいわけlove_song.ogg mikulukagumi-prayfor.ogg miku-maple-初音ミク楓-オリジナル曲.ogg miku-more1.5.ogg mik...@rk-eklosion.ogg mik...@rk-kirch.ogg miku-nana-ボーナストラック-ハッピー般若コア.ogg miku-nekomimiswitch.ogg miku-nightrainbow.ogg miku-noyounome.ogg miku-むかしむかしのきょうのぼくオリジナル.ogg miku-pandolistp-neverendinghammertime.ogg miku-ジラートP-birthdayofeden-deepsleep.ogg miku-ジラートP-birthdayofeden-水中読書.ogg miku-plustellia-dear.ogg miku-plustellia-壁の彩度-crazygirl.ogg miku-plustellia-壁の彩度-discoradio.ogg miku-ぽわぽわP-ストロボライト.ogg miku-rabbitforgets.ogg miku-re:package-lastnightgoodnight.ogg miku-re:package-ourmusic.ogg miku-re:package-sutorobonaitsu.ogg miku-rollinggirl.ogg miku-ryo-メルト-melt.ogg miku-senseiniitteyaro.ogg miku-sevencolors-レモネード.ogg miku-shoukinosatadenia.ogg miku-stratosphere.ogg miku-supernova.ogg
Re: [Haskell-cafe] Finding longest common prefixes in a list
On 20/01/12 18:45, Gwern Branwen wrote: Recently I wanted to sort through a large folder of varied files and figure out what is a 'natural' folder to split out, where natural means something like4 files with the same prefix. My idea for an algorithm would be: build a trie for the input strings, and then look for the deepest subtries with more than one child. For example, a trie containing the strings chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg choucho-lastnightgoodnight.ogg looks like: root (3 items) c (3 items) h (3 items) o (3 items) r (2 items) u (2 items) s (2 items) - (2 items) k (1 item) i (1 item) minoshiranaimonogatari.ogg m (1 item) r (1 item) music.ogg u (1 item) c (1 item) ho-lastnightgoodnight.ogg Where actually the lines with more than one character are also subtrees of subtrees of subtrees. Here is some example code (untested): import qualified Data.Map as Map -- A trie datatype data Trie a = Trie { numLeafs, numDescendant :: !Int , children :: Map.Map a (Trie a) } -- The empty trie empty :: Trie a empty = Trie 0 0 Map.empty -- A trie that contains a single string singleton :: Ord a = [a] - Trie a singleton [] = Trie 1 1 Map.empty singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs) -- Merge two tries merge :: Ord a = Trie a - Trie a - Trie a merge (Trie l d c) (Trie l' d' c') = Trie (l+l') (d+d') (Map.unionWith merge c c') fromList :: Ord a = [[a]] - Trie a fromList = foldr merge empty . map singleton toList :: Ord a = Trie a - [[a]] toList (Trie l _ c) = replicate l [] ++ [ x:xs | (x,t) - Map.toList c, xs - toList t ] data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a } atLeastThisManyDescendants :: Int - Trie a - [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t) | d minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) names | (x,t) - Map.toList c , Prefix pfx names - atLeastThisManyDescendants n t ] Twan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Finding longest common prefixes in a list
On Fri, Jan 20, 2012 at 1:57 PM, Twan van Laarhoven twa...@gmail.com wrote: Here is some example code (untested): Well, you're right that it doesn't work. I tried to fix the crucial function, 'atLeastThisManyDescendants', but it's missing something because varying parts doesn't much affect the results when I try it out on example input - it either returns everything or nothing, it seems: atLeastThisManyDescendants :: Int - Trie a - [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t') | d minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) nms | (x,t) - Map.toList t' , Prefix pfx nms - atLeastThisManyDescendants l t ] -- gwern http://www.gwern.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe