[Haskell-cafe] SIGSEGV in yieldCapability ()

2011-05-22 Thread Johannes Waldmann
Hi. What would be the reason for

Program received signal SIGSEGV, Segmentation fault.
0x006071de in yieldCapability ()
(gdb) where
#0  0x006071de in yieldCapability ()
#1  0x0060bc6b in schedule ()
#2  0x00609294 in real_main ()
#3  0x0060938d in hs_main ()
#4  0x76b07eff in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
#5  0x0040a589 in _start ()

this happens in a program compiled with ghc-7.0.3
(and also, with ghc-6.12.3), running with +RTS -N
on i7-920, ubuntu 11 (kernel 2.6.38-8)

Could this simply mean heap exhausted?

Thanks - J.W.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Handling a large database (of ngrams)

2011-05-22 Thread Aleksandar Dimitrov
Hi Wren,

First of all, thanks for your elaborate answer! Your input is very much
appreciated!

On Sat, May 21, 2011 at 10:42:57PM -0400, wren ng thornton wrote:
 I've been working on some n-gram stuff lately, which I'm hoping to
 put up on Hackage sometime this summer (i.e., in a month or so,
 after polishing off a few rough edges).

Unfortunately, my current time constraints don't allow me to wait for your code
:-( However, research on that system might continue well into the summer (if I
can manage to get paid for it…) so I'll keep an eye out for your announcement!

 Unless you really need to get your hands dirty, the first thing you
 should do is check out SRILM and see if that can handle it. Assuming
 SRILM can do it, you'd just need to write some FFI code to call into
 it from Haskell. This is relatively straightforward, and if it
 hasn't been done already it'd be a great contribution to the Haskell
 NLP community. One great thing about this approach is that it's
 pretty easy to have the SRILM server running on a dedicated machine,
 so that the memory requirements of the model don't interfere with
 the memory requirements of your program. (I can give you pointers to
 Java code doing all this, if you'd like.) One big caveat to bear in
 mind is that SRILM is not threadsafe, so that'll get in the way of
 any parallelism or distributed computing on the client side of
 things. Also, SRILM is hell to install.

I don't need distributed computing, but I haven't used SRILM based on the fact
that I'd have to write a wrapper around it, and because I need to do some custom
language model stuff for which I need to write functions that operate on ngram
probabilities directly, and I want to try out different machine learning methods
to compare their merits for my research. I'll see if SRILM is flexible enough,
to do this though.

 If you have too much trouble trying to get SRILM to work, there's
 also the Berkeley LM which is easier to install. I'm not familiar
 with its inner workings, but it should offer pretty much the same
 sorts of operations.

Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.

 But, with that in mind I can offer a few pointers for doing it in
 Haskell. The first one is interning. Assuming the data is already
 tokenized (or can be made to be), then it should be straightforward
 to write a streaming program that converts every unique token into
 an integer (storing the conversion table somewhere for later use).

This is what I meant by using the flyweight pattern. There's of course also the
possibility of computing a hash of every string, but I don't want to deal with
hash collisions. While they are largely avoidable (using, say, SHA1) but since
that would force a multi-byte index, I don't know if that would help too much,
seeing as the average word length isn't dramatically big.

 It doesn't even need to be complete morphological
 segmentation, just break on the obvious boundaries in order to get
 the average word size down to something reasonable.

I will do my best to avoid having to do this. My current research target is
German, which is richer in morphology than English, but not as much as Turkic,
Slavic or Uralic languages. In case of a strong inflectional language or even an
agglutinating or polysynthetic one, the need for a smarter morphological
analysis would arise. Unfortunately, this would have to be in the form of
language-specific morphological analysers.

Since this is something I would want to do in general anyway, I might write a
morphological analyser for German that breaks words down to case and tense
markings as well as lemmas, but this isn't the focus of my project, so I'll
first try to do without.

 For regular projects, that integerization would be enough, but for
 your task you'll probably want to spend some time tweaking the
 codes. In particular, you'll probably have enough word types to
 overflow the space of Int32/Word32 or even Int64/Word64.

I will, most likely, because of the long tail of word frequencies, run into
problems with integer space; not only that, but these ultra-rare words aren't
really going to be of much use for me. I'm debating using a trie for common
words while keeping a backlog of rare words in a mongoDB instance. Rare words
could graduate from there if they occur frequently enough, and get accepted into
the trie for easier access. Everything left over in the long tail end would just
be mapped to RARETAG in the n-grams, where TAG refers to the part of speech
tag (since my corpus is pos-tagged.) What number constitutes rare and common
will have to be subject to experimentation.

A bloom filter might guard the trie, too. Since BFs can tell you if a certain
element is certainly *not* in a collection, I could cut down search operations
on the trie itself for all the rare words. AFAICR there's a chapter in RWH on
building Bloom filters.

There are 

Re: [Haskell-cafe] SIGSEGV in yieldCapability ()

2011-05-22 Thread Felipe Almeida Lessa
On Sun, May 22, 2011 at 9:31 AM, Johannes Waldmann
waldm...@imn.htwk-leipzig.de wrote:
 Hi. What would be the reason for

 Program received signal SIGSEGV, Segmentation fault.

Looks like a bad bug, we shouldn't expect segmentation faults on the
runtime system.  I think you should file a bug report with a test case
on GHC.

Cheers =),

-- 
Felipe.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] ANNOUNCE: wl-pprint-text 1.0.0.0

2011-05-22 Thread Ivan Lazar Miljenovic
And lo, the people of Haskell hath declared that the text [1] library
was to be the One True Way of dealing with textual data (unless, of
course, you're annoyed about it using UTF-16 instead of UTF-8).  And
yet there was much sadness in the land, for there was no
pretty-printing library available to produce said Text values, meaning
that it was difficult for supplicants to produce the formatting their
form of praise required.

A voice then spake: make ye a new pretty-printing library aimed
specifically for Text values.  But thou shall not use the API from
pretty [2], as it is rather limited.  Instead, the design of wl-pprint
[3] looks Real Good!  It has more combinators available, can neatly
deal with wrapping lists, and also has support for producing output
for both humans and the machines that serve us so well!

Thus were mighty energies aimed at producing such a library.  Many
were the hour at which toil was spent on this task, but at long last
the task was completed.  Yet great evil (i.e. programmer laziness)
befell the great task, and it lay hidden for many a moon.  However,
with much effort the project was recovered, and is now available for
all! [4]

Some changes were regrettably required to achieve this task.  The most
important of which was that it was discovered that wl-pprint may not
be as perfect as all that, in that documents of no content were
treated poorly.  However, spects of the pretty package were able to
redeem themselves by supplying the necessary functionality.  However,
the great task was able to extend upon that which was there before, by
invoking the almighty powers of the Monad.

TL;DR: wl-pprint-text is a pretty-printing package for lazy Text
values based upon the API of wl-pprint.  It however deals with the nil
document better than wl-pprint does, and also has a version where all
the comibnators are lifted into a Monad, primarily so that it can be
used within a State Monad.

[1]: http://hackage.haskell.org/package/text
[2]: http://hackage.haskell.org/package/pretty
[3]: http://hackage.haskell.org/package/wl-pprint
[4]: http://hackage.haskell.org/package/wl-pprint-text

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] SIGSEGV in yieldCapability ()

2011-05-22 Thread Johannes Waldmann
  I think you should file a bug report with a test case
 on GHC.

I am willing to work on this, but I thought I'd go fishing for some
advice first. My program uses: forkIO, STM, and FFI.

I think that heap exhausted sometimes gets reported 
as evacuate: strange closure, 
(cf.  http://hackage.haskell.org/trac/ghc/ticket/5085 )
and yieldCapability() might be another instance.

Thanks - J.W.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Is there an efficient way to generate Euler's totient function for [2, 3..n]?

2011-05-22 Thread Henning Thielemann
Daniel Fischer schrieb:

 On Saturday 14 May 2011 19:38:03, KC wrote:
 Instead of finding the totient of one number, is there a quicker way
 when processing a sequence?
 
 For some sequences.

You may find alternative ways of computation in the Online Encyclopedia
of Integer Sequences.

http://oeis.org/A10

 -- From: http://www.haskell.org/haskellwiki/99_questions/Solutions/34
 totient :: Int - Int
 totient n = length [x | x - [1..n], coprime x n]
where
coprime a b = gcd a b == 1
 
 NEVER do that! It's awfully slow.

It's declarative and may help to verify more efficient implementations.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Is there an efficient way to generate Euler's totient function for [2, 3..n]?

2011-05-22 Thread KC
 It's declarative and may help to verify more efficient implementations.

WOW! Good insight. :)


On Sun, May 22, 2011 at 9:27 AM, Henning Thielemann
schlepp...@henning-thielemann.de wrote:
 Daniel Fischer schrieb:

 On Saturday 14 May 2011 19:38:03, KC wrote:
 Instead of finding the totient of one number, is there a quicker way
 when processing a sequence?

 For some sequences.

 You may find alternative ways of computation in the Online Encyclopedia
 of Integer Sequences.

 http://oeis.org/A10

 -- From: http://www.haskell.org/haskellwiki/99_questions/Solutions/34
 totient :: Int - Int
 totient n = length [x | x - [1..n], coprime x n]
    where
    coprime a b = gcd a b == 1

 NEVER do that! It's awfully slow.

 It's declarative and may help to verify more efficient implementations.


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
--
Regards,
KC

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] code review?

2011-05-22 Thread Chris Bolton
Hey guys. Pretty new to haskell here. I've started off by writing a small
command-line interface to plurk.. I was just wondering if anyone would be
willing to give everything a look-over and lemme know what kinds of things I
should be looking to improve upon, style-wise. Not sure I'm currently doing
things in the  'haskell-way' so to speak . Thanks a bunch!

https://github.com/saiko-chriskun/hermes

(note: the JSON and Plurkell submodules are also mine.)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] code review?

2011-05-22 Thread Evan Laforge
I'd keep the line length down to 80 chars, and not use ';'s.

All of that fiddling with buffering, printing, reading results could
be more clearly put into a couple of functions.

'if all == False then return False else return True' is a pretty
confusing way to say 'return all'.  In fact, any time you see 'x ==
True' you can just remove the '== True'.  The whole postAll thing
would be clearer as

postAll - if all  not first then return all else ask Post to all?
post - if postAll then return True else ask Post to..?

Anywhere you have 'head x' or 'x!!0' or a 'case length xs of' that's a
chance to crash.  Don't do that.  You can get rid of the heads by
writing (config:configs) and [] cases for postStatus.  Get rid of the
!!0 by making config into a data type, it looks like 'data Config =
Config { configPostTo :: URL?, configUser :: Maybe String, configPass
:: Maybe String }'.  Then 'pass - maybe (ask pass?) return
(configPass config)'.  Of course, why make these things optional at
all?

Looks like the postStatus return value is not used.  It would simplify
it to not return those codes.

I don't know anything about 'postPlurk' but it looks like it could
return a real data type as well.

All this nested if else stuff makes it hard to read, but I think you
can replace the explicit recursion in postStatus with 'mapM_
(postStatus update) configs'.  It looks like that mysterious (all,
first) pair has a different value for the first one, in that case it
would be clearer to either not do that, or do it explicitly like

case configs of
  first : rest - postFirst update first  mapM_ (postStatus update) rest
  [] - complain about no configs

If you pass a single Bool to a function, it means it can have two
behaviours, which is confusing.  If you pass two Bools, then it can
have 4, which is even more confusing :)  I myself use if/else only
rarely.

Looking a little at Plurkell, 'return =' is redundant.  And I'm sure
there's a urlEncode function around so you don't have to build URLs
yourself?

I don't understand the stuff with the words in the case, but it looks
like a confusing way to say 'if word `elem` specialWords'.  There's
also a Set type for that kind of thing.  And that regex stuff is...
quoting?  Isn't there a library function for that too?  It's the sort
of thing a URL library would have.

If not, it's something like 'replace [(|, %7C), (/, %2F), (
, %20)]', right?  I'm sure there's a replace function like that
floating around somewhere, if not, you can write your own.

And for JSON... wasn't someone just complaining about there being too
many JSON libraries on hackage?  Unless you want to do it yourself for
fun (it's a good way to learn parsing), why not just download one of
those?

That's enough for now, have fun :)

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] code review?

2011-05-22 Thread Alexander Solla
Good points.  Here are a few more.  Use more explicit types.  You can cut
down on if-then-else's by using pattern matching and guards.  (For example,
the first if-then-else in postStatus can be turned into:

newtype PostToAllNetworks = PostToAll Bool
newtype FirstPost = FirstPost Bool

postToAllNetworks :: PostToAllNetworks - FirstPost - IO Bool
PostToAllNetworks (PostToAll True) (First True) = do
putStr $ Post to all networks? [y/n] 
hSetBuffering stdin NoBuffering ; hFlush stdout
postAll - getChar ; hSetBuffering stdin LineBuffering
return (postAll' == 'y')

postToAllNetworks _ _ = return False

Notice I'm not just playing golf here.  This is easier to read.  Also, keep
in mind that post is a noun and a verb.  It is very functional to write
functions whose names are nouns.  It might be nice if Haskell had some
syntax like Ruby does, so we could ask postToAllNetworks?  But since we
don't, you should be open to the possibility that a name can be either.

On Sun, May 22, 2011 at 3:32 PM, Evan Laforge qdun...@gmail.com wrote:

 I'd keep the line length down to 80 chars, and not use ';'s.

 All of that fiddling with buffering, printing, reading results could
 be more clearly put into a couple of functions.

 'if all == False then return False else return True' is a pretty
 confusing way to say 'return all'.  In fact, any time you see 'x ==
 True' you can just remove the '== True'.  The whole postAll thing
 would be clearer as

 postAll - if all  not first then return all else ask Post to all?
 post - if postAll then return True else ask Post to..?

 Anywhere you have 'head x' or 'x!!0' or a 'case length xs of' that's a
 chance to crash.  Don't do that.  You can get rid of the heads by
 writing (config:configs) and [] cases for postStatus.  Get rid of the
 !!0 by making config into a data type, it looks like 'data Config =
 Config { configPostTo :: URL?, configUser :: Maybe String, configPass
 :: Maybe String }'.  Then 'pass - maybe (ask pass?) return
 (configPass config)'.  Of course, why make these things optional at
 all?

 Looks like the postStatus return value is not used.  It would simplify
 it to not return those codes.

 I don't know anything about 'postPlurk' but it looks like it could
 return a real data type as well.

 All this nested if else stuff makes it hard to read, but I think you
 can replace the explicit recursion in postStatus with 'mapM_
 (postStatus update) configs'.  It looks like that mysterious (all,
 first) pair has a different value for the first one, in that case it
 would be clearer to either not do that, or do it explicitly like

 case configs of
  first : rest - postFirst update first  mapM_ (postStatus update) rest
  [] - complain about no configs

 If you pass a single Bool to a function, it means it can have two
 behaviours, which is confusing.  If you pass two Bools, then it can
 have 4, which is even more confusing :)  I myself use if/else only
 rarely.

 Looking a little at Plurkell, 'return =' is redundant.  And I'm sure
 there's a urlEncode function around so you don't have to build URLs
 yourself?

 I don't understand the stuff with the words in the case, but it looks
 like a confusing way to say 'if word `elem` specialWords'.  There's
 also a Set type for that kind of thing.  And that regex stuff is...
 quoting?  Isn't there a library function for that too?  It's the sort
 of thing a URL library would have.

 If not, it's something like 'replace [(|, %7C), (/, %2F), (
 , %20)]', right?  I'm sure there's a replace function like that
 floating around somewhere, if not, you can write your own.

 And for JSON... wasn't someone just complaining about there being too
 many JSON libraries on hackage?  Unless you want to do it yourself for
 fun (it's a good way to learn parsing), why not just download one of
 those?

 That's enough for now, have fun :)

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe