[Haskell-cafe] SIGSEGV in yieldCapability ()
Hi. What would be the reason for Program received signal SIGSEGV, Segmentation fault. 0x006071de in yieldCapability () (gdb) where #0 0x006071de in yieldCapability () #1 0x0060bc6b in schedule () #2 0x00609294 in real_main () #3 0x0060938d in hs_main () #4 0x76b07eff in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x0040a589 in _start () this happens in a program compiled with ghc-7.0.3 (and also, with ghc-6.12.3), running with +RTS -N on i7-920, ubuntu 11 (kernel 2.6.38-8) Could this simply mean heap exhausted? Thanks - J.W. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Handling a large database (of ngrams)
Hi Wren, First of all, thanks for your elaborate answer! Your input is very much appreciated! On Sat, May 21, 2011 at 10:42:57PM -0400, wren ng thornton wrote: I've been working on some n-gram stuff lately, which I'm hoping to put up on Hackage sometime this summer (i.e., in a month or so, after polishing off a few rough edges). Unfortunately, my current time constraints don't allow me to wait for your code :-( However, research on that system might continue well into the summer (if I can manage to get paid for it…) so I'll keep an eye out for your announcement! Unless you really need to get your hands dirty, the first thing you should do is check out SRILM and see if that can handle it. Assuming SRILM can do it, you'd just need to write some FFI code to call into it from Haskell. This is relatively straightforward, and if it hasn't been done already it'd be a great contribution to the Haskell NLP community. One great thing about this approach is that it's pretty easy to have the SRILM server running on a dedicated machine, so that the memory requirements of the model don't interfere with the memory requirements of your program. (I can give you pointers to Java code doing all this, if you'd like.) One big caveat to bear in mind is that SRILM is not threadsafe, so that'll get in the way of any parallelism or distributed computing on the client side of things. Also, SRILM is hell to install. I don't need distributed computing, but I haven't used SRILM based on the fact that I'd have to write a wrapper around it, and because I need to do some custom language model stuff for which I need to write functions that operate on ngram probabilities directly, and I want to try out different machine learning methods to compare their merits for my research. I'll see if SRILM is flexible enough, to do this though. If you have too much trouble trying to get SRILM to work, there's also the Berkeley LM which is easier to install. I'm not familiar with its inner workings, but it should offer pretty much the same sorts of operations. Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large data sets? Maybe this is also the wrong list to ask for this kind of question. But, with that in mind I can offer a few pointers for doing it in Haskell. The first one is interning. Assuming the data is already tokenized (or can be made to be), then it should be straightforward to write a streaming program that converts every unique token into an integer (storing the conversion table somewhere for later use). This is what I meant by using the flyweight pattern. There's of course also the possibility of computing a hash of every string, but I don't want to deal with hash collisions. While they are largely avoidable (using, say, SHA1) but since that would force a multi-byte index, I don't know if that would help too much, seeing as the average word length isn't dramatically big. It doesn't even need to be complete morphological segmentation, just break on the obvious boundaries in order to get the average word size down to something reasonable. I will do my best to avoid having to do this. My current research target is German, which is richer in morphology than English, but not as much as Turkic, Slavic or Uralic languages. In case of a strong inflectional language or even an agglutinating or polysynthetic one, the need for a smarter morphological analysis would arise. Unfortunately, this would have to be in the form of language-specific morphological analysers. Since this is something I would want to do in general anyway, I might write a morphological analyser for German that breaks words down to case and tense markings as well as lemmas, but this isn't the focus of my project, so I'll first try to do without. For regular projects, that integerization would be enough, but for your task you'll probably want to spend some time tweaking the codes. In particular, you'll probably have enough word types to overflow the space of Int32/Word32 or even Int64/Word64. I will, most likely, because of the long tail of word frequencies, run into problems with integer space; not only that, but these ultra-rare words aren't really going to be of much use for me. I'm debating using a trie for common words while keeping a backlog of rare words in a mongoDB instance. Rare words could graduate from there if they occur frequently enough, and get accepted into the trie for easier access. Everything left over in the long tail end would just be mapped to RARETAG in the n-grams, where TAG refers to the part of speech tag (since my corpus is pos-tagged.) What number constitutes rare and common will have to be subject to experimentation. A bloom filter might guard the trie, too. Since BFs can tell you if a certain element is certainly *not* in a collection, I could cut down search operations on the trie itself for all the rare words. AFAICR there's a chapter in RWH on building Bloom filters. There are
Re: [Haskell-cafe] SIGSEGV in yieldCapability ()
On Sun, May 22, 2011 at 9:31 AM, Johannes Waldmann waldm...@imn.htwk-leipzig.de wrote: Hi. What would be the reason for Program received signal SIGSEGV, Segmentation fault. Looks like a bad bug, we shouldn't expect segmentation faults on the runtime system. I think you should file a bug report with a test case on GHC. Cheers =), -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] ANNOUNCE: wl-pprint-text 1.0.0.0
And lo, the people of Haskell hath declared that the text [1] library was to be the One True Way of dealing with textual data (unless, of course, you're annoyed about it using UTF-16 instead of UTF-8). And yet there was much sadness in the land, for there was no pretty-printing library available to produce said Text values, meaning that it was difficult for supplicants to produce the formatting their form of praise required. A voice then spake: make ye a new pretty-printing library aimed specifically for Text values. But thou shall not use the API from pretty [2], as it is rather limited. Instead, the design of wl-pprint [3] looks Real Good! It has more combinators available, can neatly deal with wrapping lists, and also has support for producing output for both humans and the machines that serve us so well! Thus were mighty energies aimed at producing such a library. Many were the hour at which toil was spent on this task, but at long last the task was completed. Yet great evil (i.e. programmer laziness) befell the great task, and it lay hidden for many a moon. However, with much effort the project was recovered, and is now available for all! [4] Some changes were regrettably required to achieve this task. The most important of which was that it was discovered that wl-pprint may not be as perfect as all that, in that documents of no content were treated poorly. However, spects of the pretty package were able to redeem themselves by supplying the necessary functionality. However, the great task was able to extend upon that which was there before, by invoking the almighty powers of the Monad. TL;DR: wl-pprint-text is a pretty-printing package for lazy Text values based upon the API of wl-pprint. It however deals with the nil document better than wl-pprint does, and also has a version where all the comibnators are lifted into a Monad, primarily so that it can be used within a State Monad. [1]: http://hackage.haskell.org/package/text [2]: http://hackage.haskell.org/package/pretty [3]: http://hackage.haskell.org/package/wl-pprint [4]: http://hackage.haskell.org/package/wl-pprint-text -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] SIGSEGV in yieldCapability ()
I think you should file a bug report with a test case on GHC. I am willing to work on this, but I thought I'd go fishing for some advice first. My program uses: forkIO, STM, and FFI. I think that heap exhausted sometimes gets reported as evacuate: strange closure, (cf. http://hackage.haskell.org/trac/ghc/ticket/5085 ) and yieldCapability() might be another instance. Thanks - J.W. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Is there an efficient way to generate Euler's totient function for [2, 3..n]?
Daniel Fischer schrieb: On Saturday 14 May 2011 19:38:03, KC wrote: Instead of finding the totient of one number, is there a quicker way when processing a sequence? For some sequences. You may find alternative ways of computation in the Online Encyclopedia of Integer Sequences. http://oeis.org/A10 -- From: http://www.haskell.org/haskellwiki/99_questions/Solutions/34 totient :: Int - Int totient n = length [x | x - [1..n], coprime x n] where coprime a b = gcd a b == 1 NEVER do that! It's awfully slow. It's declarative and may help to verify more efficient implementations. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Is there an efficient way to generate Euler's totient function for [2, 3..n]?
It's declarative and may help to verify more efficient implementations. WOW! Good insight. :) On Sun, May 22, 2011 at 9:27 AM, Henning Thielemann schlepp...@henning-thielemann.de wrote: Daniel Fischer schrieb: On Saturday 14 May 2011 19:38:03, KC wrote: Instead of finding the totient of one number, is there a quicker way when processing a sequence? For some sequences. You may find alternative ways of computation in the Online Encyclopedia of Integer Sequences. http://oeis.org/A10 -- From: http://www.haskell.org/haskellwiki/99_questions/Solutions/34 totient :: Int - Int totient n = length [x | x - [1..n], coprime x n] where coprime a b = gcd a b == 1 NEVER do that! It's awfully slow. It's declarative and may help to verify more efficient implementations. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- -- Regards, KC ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] code review?
Hey guys. Pretty new to haskell here. I've started off by writing a small command-line interface to plurk.. I was just wondering if anyone would be willing to give everything a look-over and lemme know what kinds of things I should be looking to improve upon, style-wise. Not sure I'm currently doing things in the 'haskell-way' so to speak . Thanks a bunch! https://github.com/saiko-chriskun/hermes (note: the JSON and Plurkell submodules are also mine.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] code review?
I'd keep the line length down to 80 chars, and not use ';'s. All of that fiddling with buffering, printing, reading results could be more clearly put into a couple of functions. 'if all == False then return False else return True' is a pretty confusing way to say 'return all'. In fact, any time you see 'x == True' you can just remove the '== True'. The whole postAll thing would be clearer as postAll - if all not first then return all else ask Post to all? post - if postAll then return True else ask Post to..? Anywhere you have 'head x' or 'x!!0' or a 'case length xs of' that's a chance to crash. Don't do that. You can get rid of the heads by writing (config:configs) and [] cases for postStatus. Get rid of the !!0 by making config into a data type, it looks like 'data Config = Config { configPostTo :: URL?, configUser :: Maybe String, configPass :: Maybe String }'. Then 'pass - maybe (ask pass?) return (configPass config)'. Of course, why make these things optional at all? Looks like the postStatus return value is not used. It would simplify it to not return those codes. I don't know anything about 'postPlurk' but it looks like it could return a real data type as well. All this nested if else stuff makes it hard to read, but I think you can replace the explicit recursion in postStatus with 'mapM_ (postStatus update) configs'. It looks like that mysterious (all, first) pair has a different value for the first one, in that case it would be clearer to either not do that, or do it explicitly like case configs of first : rest - postFirst update first mapM_ (postStatus update) rest [] - complain about no configs If you pass a single Bool to a function, it means it can have two behaviours, which is confusing. If you pass two Bools, then it can have 4, which is even more confusing :) I myself use if/else only rarely. Looking a little at Plurkell, 'return =' is redundant. And I'm sure there's a urlEncode function around so you don't have to build URLs yourself? I don't understand the stuff with the words in the case, but it looks like a confusing way to say 'if word `elem` specialWords'. There's also a Set type for that kind of thing. And that regex stuff is... quoting? Isn't there a library function for that too? It's the sort of thing a URL library would have. If not, it's something like 'replace [(|, %7C), (/, %2F), ( , %20)]', right? I'm sure there's a replace function like that floating around somewhere, if not, you can write your own. And for JSON... wasn't someone just complaining about there being too many JSON libraries on hackage? Unless you want to do it yourself for fun (it's a good way to learn parsing), why not just download one of those? That's enough for now, have fun :) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] code review?
Good points. Here are a few more. Use more explicit types. You can cut down on if-then-else's by using pattern matching and guards. (For example, the first if-then-else in postStatus can be turned into: newtype PostToAllNetworks = PostToAll Bool newtype FirstPost = FirstPost Bool postToAllNetworks :: PostToAllNetworks - FirstPost - IO Bool PostToAllNetworks (PostToAll True) (First True) = do putStr $ Post to all networks? [y/n] hSetBuffering stdin NoBuffering ; hFlush stdout postAll - getChar ; hSetBuffering stdin LineBuffering return (postAll' == 'y') postToAllNetworks _ _ = return False Notice I'm not just playing golf here. This is easier to read. Also, keep in mind that post is a noun and a verb. It is very functional to write functions whose names are nouns. It might be nice if Haskell had some syntax like Ruby does, so we could ask postToAllNetworks? But since we don't, you should be open to the possibility that a name can be either. On Sun, May 22, 2011 at 3:32 PM, Evan Laforge qdun...@gmail.com wrote: I'd keep the line length down to 80 chars, and not use ';'s. All of that fiddling with buffering, printing, reading results could be more clearly put into a couple of functions. 'if all == False then return False else return True' is a pretty confusing way to say 'return all'. In fact, any time you see 'x == True' you can just remove the '== True'. The whole postAll thing would be clearer as postAll - if all not first then return all else ask Post to all? post - if postAll then return True else ask Post to..? Anywhere you have 'head x' or 'x!!0' or a 'case length xs of' that's a chance to crash. Don't do that. You can get rid of the heads by writing (config:configs) and [] cases for postStatus. Get rid of the !!0 by making config into a data type, it looks like 'data Config = Config { configPostTo :: URL?, configUser :: Maybe String, configPass :: Maybe String }'. Then 'pass - maybe (ask pass?) return (configPass config)'. Of course, why make these things optional at all? Looks like the postStatus return value is not used. It would simplify it to not return those codes. I don't know anything about 'postPlurk' but it looks like it could return a real data type as well. All this nested if else stuff makes it hard to read, but I think you can replace the explicit recursion in postStatus with 'mapM_ (postStatus update) configs'. It looks like that mysterious (all, first) pair has a different value for the first one, in that case it would be clearer to either not do that, or do it explicitly like case configs of first : rest - postFirst update first mapM_ (postStatus update) rest [] - complain about no configs If you pass a single Bool to a function, it means it can have two behaviours, which is confusing. If you pass two Bools, then it can have 4, which is even more confusing :) I myself use if/else only rarely. Looking a little at Plurkell, 'return =' is redundant. And I'm sure there's a urlEncode function around so you don't have to build URLs yourself? I don't understand the stuff with the words in the case, but it looks like a confusing way to say 'if word `elem` specialWords'. There's also a Set type for that kind of thing. And that regex stuff is... quoting? Isn't there a library function for that too? It's the sort of thing a URL library would have. If not, it's something like 'replace [(|, %7C), (/, %2F), ( , %20)]', right? I'm sure there's a replace function like that floating around somewhere, if not, you can write your own. And for JSON... wasn't someone just complaining about there being too many JSON libraries on hackage? Unless you want to do it yourself for fun (it's a good way to learn parsing), why not just download one of those? That's enough for now, have fun :) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe