[Computer-go] Datasets for CNN training?

2015-01-11 Thread Hugh Perkins
Thinking about datasets for CNN training, of which I lack one
currently :-P  Hence I've been using MNIST , but also since MNIST
results are widely known, and if I train with a couple of layers, and
get 12% accuracy, obviously I know I have to fix something :-P

But now, my network consistently gets up into the 97-98%s for mnist,
even with just a layer or two, and speed is ok-ish, and probably want
to start running training against 19x19 boards instead of 28x28.  The
optimization is different.  On my laptop, an OpenCL workgroup can hold
a 19x19 board, with one thread per intersection, but 28x28 threads
would exceed the workgroup size.  Unless I loop, or break into two
workgroups, or something else equally buggy, slow, and
high-maintenance :-P

So, I could crop the mnist boards down to 19x19, but whoever heard of
training on 19x19 mnist boards?

So, possibly time to start hitting actual Go boards.  Many other
datasets are available in a standardized generic format, ready to feed
into any machine learning algorithm.  For example, those provided at
libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
, or mnist, yann.lecun.com/exdb/mnist/ .  The go datasets are not
(yet) available in any kind of standard format so I'm thinking, maybe
that could be useful to do so?  But there are three challenges:

1. what data to store?  Clark and Storkey planes? Raw boards? Maddison
et al planes? Something else?  For now, my answer is: something
corresponding to an actual existing paper, and Clark and Storkey's
network has the advantage of costing less than 2000usd to train, so
that's my answer to 'what data to store?'
2. copyright.  gogod is apparently a. copyrighted as a collection b.
compiled by hand as a result of painstakingly going through each game,
move by move, and entering into the computer, one move at a time.
Probably not really likely that one could put this, even preprocessed,
as a standard dataset?  However, the good news is that the gks dataset
seems publically available, and big, maybe just use that?
3. size . this is where I dont have an answer yet.
- 8 million states, where each state is 8 planes * 351 locations = 20GB :-P
- the raw sgfs only take 3KB per game, for a total of about 80MB,
but needs a lot of preprocessing, and if one were to feed each game
through, in order, might not be the best sequence for effective
learning?
- current idea: encode one column through the planes as a single
byte?  For Clark and Storkey they only have 8 planes, so this should
be easy enough :-)
- which would be 2.6GB instead
- but still kind of large, to put on my web hosting :-P

I suppose a compromise could be needed, which would also solve problem
number 1 somewhat, of just providing a tool, eg in Python, or C, or
Cython, which will take the kgs downloads, possibly the gogod
download, and transform it into a 2.6GB dataset, ready for training,
and possibly pre-shuffled?

But this would be quite non-standard, although this is not unheard of,
eg for imagenet, there is a devkit
http://image-net.org/challenges/LSVRC/2011/index#devkit

Maybe I will create a github project, like 'kgs-dataset-preprocessor'?
 Could work something like ?:

   python kgs-dataset-preprocessor.py [targetdirectory]

Results:
- the datasets are downloaded from http://u-go.net/gamerecords/
- decompressed
- loaded once at a time, and processed into a 2.5GB datafile, in
sequence (clients can handle shuffling themselves I suppose?)

Thoughts?

Hugh
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Datasets for CNN training?

2015-01-11 Thread Hugh Perkins
Made a start here: https://github.com/hughperkins/kgsgo-dataset-preprocessor
- downlaods the html page,with list of download zip urls from kgs
- downlaods the zip files, based on html page
- unzips the zip files
- loads each sgf file in turn
- uses gomill to parse the sgf file, check it is 19x19, and no handicap

... and on the other hand created some classes to handle the mechanics
of a Go game:
- GoBoard: represents a go board, can apply moves, handles captures,
detects Ko, contains GoStrings
- a GoString is a string of contiguous pieces of the same color.
Holds also a full list of all liberties
- Bag2d is a double-indexed bag of 2d locations:
   - given any location, know whether it is in the bag or not, in O(1)
   - can iterate the locatinos, O(1) per location iterated
   - can erase a location in O(1)

... so now just need to link these together, and pump out the binary data file


On 1/11/15, Hugh Perkins hughperk...@gmail.com wrote:
 Thinking about datasets for CNN training, of which I lack one
 currently :-P  Hence I've been using MNIST , but also since MNIST
 results are widely known, and if I train with a couple of layers, and
 get 12% accuracy, obviously I know I have to fix something :-P

 But now, my network consistently gets up into the 97-98%s for mnist,
 even with just a layer or two, and speed is ok-ish, and probably want
 to start running training against 19x19 boards instead of 28x28.  The
 optimization is different.  On my laptop, an OpenCL workgroup can hold
 a 19x19 board, with one thread per intersection, but 28x28 threads
 would exceed the workgroup size.  Unless I loop, or break into two
 workgroups, or something else equally buggy, slow, and
 high-maintenance :-P

 So, I could crop the mnist boards down to 19x19, but whoever heard of
 training on 19x19 mnist boards?

 So, possibly time to start hitting actual Go boards.  Many other
 datasets are available in a standardized generic format, ready to feed
 into any machine learning algorithm.  For example, those provided at
 libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
 , or mnist, yann.lecun.com/exdb/mnist/ .  The go datasets are not
 (yet) available in any kind of standard format so I'm thinking, maybe
 that could be useful to do so?  But there are three challenges:

 1. what data to store?  Clark and Storkey planes? Raw boards? Maddison
 et al planes? Something else?  For now, my answer is: something
 corresponding to an actual existing paper, and Clark and Storkey's
 network has the advantage of costing less than 2000usd to train, so
 that's my answer to 'what data to store?'
 2. copyright.  gogod is apparently a. copyrighted as a collection b.
 compiled by hand as a result of painstakingly going through each game,
 move by move, and entering into the computer, one move at a time.
 Probably not really likely that one could put this, even preprocessed,
 as a standard dataset?  However, the good news is that the gks dataset
 seems publically available, and big, maybe just use that?
 3. size . this is where I dont have an answer yet.
 - 8 million states, where each state is 8 planes * 351 locations = 20GB
 :-P
 - the raw sgfs only take 3KB per game, for a total of about 80MB,
 but needs a lot of preprocessing, and if one were to feed each game
 through, in order, might not be the best sequence for effective
 learning?
 - current idea: encode one column through the planes as a single
 byte?  For Clark and Storkey they only have 8 planes, so this should
 be easy enough :-)
 - which would be 2.6GB instead
 - but still kind of large, to put on my web hosting :-P

 I suppose a compromise could be needed, which would also solve problem
 number 1 somewhat, of just providing a tool, eg in Python, or C, or
 Cython, which will take the kgs downloads, possibly the gogod
 download, and transform it into a 2.6GB dataset, ready for training,
 and possibly pre-shuffled?

 But this would be quite non-standard, although this is not unheard of,
 eg for imagenet, there is a devkit
 http://image-net.org/challenges/LSVRC/2011/index#devkit

 Maybe I will create a github project, like 'kgs-dataset-preprocessor'?
  Could work something like ?:

python kgs-dataset-preprocessor.py [targetdirectory]

 Results:
 - the datasets are downloaded from http://u-go.net/gamerecords/
 - decompressed
 - loaded once at a time, and processed into a 2.5GB datafile, in
 sequence (clients can handle shuffling themselves I suppose?)

 Thoughts?

 Hugh

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Datasets for CNN training?

2015-01-11 Thread David Fotland
Why don’t you make a dataset of the raw board positions, along with code to 
convert to Clark and Storkey planes?  The data will be smaller, people can 
verify against Clark and Storkey, and they have the data to make their own 
choices about preprocessing for network inputs.

David

 -Original Message-
 From: Computer-go [mailto:computer-go-boun...@computer-go.org] On
 Behalf Of Hugh Perkins
 Sent: Sunday, January 11, 2015 12:24 AM
 To: computer-go
 Subject: [Computer-go] Datasets for CNN training?
 
 Thinking about datasets for CNN training, of which I lack one currently
 :-P  Hence I've been using MNIST , but also since MNIST results are
 widely known, and if I train with a couple of layers, and get 12%
 accuracy, obviously I know I have to fix something :-P
 
 But now, my network consistently gets up into the 97-98%s for mnist,
 even with just a layer or two, and speed is ok-ish, and probably want
 to start running training against 19x19 boards instead of 28x28.  The
 optimization is different.  On my laptop, an OpenCL workgroup can hold
 a 19x19 board, with one thread per intersection, but 28x28 threads
 would exceed the workgroup size.  Unless I loop, or break into two
 workgroups, or something else equally buggy, slow, and high-maintenance
 :-P
 
 So, I could crop the mnist boards down to 19x19, but whoever heard of
 training on 19x19 mnist boards?
 
 So, possibly time to start hitting actual Go boards.  Many other
 datasets are available in a standardized generic format, ready to feed
 into any machine learning algorithm.  For example, those provided at
 libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
 , or mnist, yann.lecun.com/exdb/mnist/ .  The go datasets are not
 (yet) available in any kind of standard format so I'm thinking, maybe
 that could be useful to do so?  But there are three challenges:
 
 1. what data to store?  Clark and Storkey planes? Raw boards? Maddison
 et al planes? Something else?  For now, my answer is: something
 corresponding to an actual existing paper, and Clark and Storkey's
 network has the advantage of costing less than 2000usd to train, so
 that's my answer to 'what data to store?'
 2. copyright.  gogod is apparently a. copyrighted as a collection b.
 compiled by hand as a result of painstakingly going through each game,
 move by move, and entering into the computer, one move at a time.
 Probably not really likely that one could put this, even preprocessed,
 as a standard dataset?  However, the good news is that the gks dataset
 seems publically available, and big, maybe just use that?
 3. size . this is where I dont have an answer yet.
 - 8 million states, where each state is 8 planes * 351 locations =
 20GB :-P
 - the raw sgfs only take 3KB per game, for a total of about 80MB,
 but needs a lot of preprocessing, and if one were to feed each game
 through, in order, might not be the best sequence for effective
 learning?
 - current idea: encode one column through the planes as a single
 byte?  For Clark and Storkey they only have 8 planes, so this should be
 easy enough :-)
 - which would be 2.6GB instead
 - but still kind of large, to put on my web hosting :-P
 
 I suppose a compromise could be needed, which would also solve problem
 number 1 somewhat, of just providing a tool, eg in Python, or C, or
 Cython, which will take the kgs downloads, possibly the gogod download,
 and transform it into a 2.6GB dataset, ready for training, and possibly
 pre-shuffled?
 
 But this would be quite non-standard, although this is not unheard of,
 eg for imagenet, there is a devkit http://image-
 net.org/challenges/LSVRC/2011/index#devkit
 
 Maybe I will create a github project, like 'kgs-dataset-preprocessor'?
  Could work something like ?:
 
python kgs-dataset-preprocessor.py [targetdirectory]
 
 Results:
 - the datasets are downloaded from http://u-go.net/gamerecords/
 - decompressed
 - loaded once at a time, and processed into a 2.5GB datafile, in
 sequence (clients can handle shuffling themselves I suppose?)
 
 Thoughts?
 
 Hugh
 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Datasets for CNN training?

2015-01-11 Thread Hugh Perkins
 Why don’t you make a dataset of the raw board positions, along with code to 
 convert to Clark and Storkey planes?  The data will be smaller, people can 
 verify against Clark and Storkey, and they have the data to make their own 
 choices about preprocessing for network inputs.

Well, a lot of the data are dynamic, eg 'moves since last move', and
cannot be obtained by looking at a single, isolated position.  The
most compact way of representing the information required is the sgf
files in fact...

What I'm thinking of doing is making the layers created some kind of
options to the script, like I want 3 layers for liberties, no matter
which side, and one layer for illegal moves, and ... etc, something
like that?

As far as downloading the data, all the sgfs, the script already does
that.  Actually, the script is pretty much finished, as far as Clark
and Storkey layers, just need to debug a bit...
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go