[Computer-go] Datasets for CNN training?
Thinking about datasets for CNN training, of which I lack one currently :-P Hence I've been using MNIST , but also since MNIST results are widely known, and if I train with a couple of layers, and get 12% accuracy, obviously I know I have to fix something :-P But now, my network consistently gets up into the 97-98%s for mnist, even with just a layer or two, and speed is ok-ish, and probably want to start running training against 19x19 boards instead of 28x28. The optimization is different. On my laptop, an OpenCL workgroup can hold a 19x19 board, with one thread per intersection, but 28x28 threads would exceed the workgroup size. Unless I loop, or break into two workgroups, or something else equally buggy, slow, and high-maintenance :-P So, I could crop the mnist boards down to 19x19, but whoever heard of training on 19x19 mnist boards? So, possibly time to start hitting actual Go boards. Many other datasets are available in a standardized generic format, ready to feed into any machine learning algorithm. For example, those provided at libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ , or mnist, yann.lecun.com/exdb/mnist/ . The go datasets are not (yet) available in any kind of standard format so I'm thinking, maybe that could be useful to do so? But there are three challenges: 1. what data to store? Clark and Storkey planes? Raw boards? Maddison et al planes? Something else? For now, my answer is: something corresponding to an actual existing paper, and Clark and Storkey's network has the advantage of costing less than 2000usd to train, so that's my answer to 'what data to store?' 2. copyright. gogod is apparently a. copyrighted as a collection b. compiled by hand as a result of painstakingly going through each game, move by move, and entering into the computer, one move at a time. Probably not really likely that one could put this, even preprocessed, as a standard dataset? However, the good news is that the gks dataset seems publically available, and big, maybe just use that? 3. size . this is where I dont have an answer yet. - 8 million states, where each state is 8 planes * 351 locations = 20GB :-P - the raw sgfs only take 3KB per game, for a total of about 80MB, but needs a lot of preprocessing, and if one were to feed each game through, in order, might not be the best sequence for effective learning? - current idea: encode one column through the planes as a single byte? For Clark and Storkey they only have 8 planes, so this should be easy enough :-) - which would be 2.6GB instead - but still kind of large, to put on my web hosting :-P I suppose a compromise could be needed, which would also solve problem number 1 somewhat, of just providing a tool, eg in Python, or C, or Cython, which will take the kgs downloads, possibly the gogod download, and transform it into a 2.6GB dataset, ready for training, and possibly pre-shuffled? But this would be quite non-standard, although this is not unheard of, eg for imagenet, there is a devkit http://image-net.org/challenges/LSVRC/2011/index#devkit Maybe I will create a github project, like 'kgs-dataset-preprocessor'? Could work something like ?: python kgs-dataset-preprocessor.py [targetdirectory] Results: - the datasets are downloaded from http://u-go.net/gamerecords/ - decompressed - loaded once at a time, and processed into a 2.5GB datafile, in sequence (clients can handle shuffling themselves I suppose?) Thoughts? Hugh ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Datasets for CNN training?
Made a start here: https://github.com/hughperkins/kgsgo-dataset-preprocessor - downlaods the html page,with list of download zip urls from kgs - downlaods the zip files, based on html page - unzips the zip files - loads each sgf file in turn - uses gomill to parse the sgf file, check it is 19x19, and no handicap ... and on the other hand created some classes to handle the mechanics of a Go game: - GoBoard: represents a go board, can apply moves, handles captures, detects Ko, contains GoStrings - a GoString is a string of contiguous pieces of the same color. Holds also a full list of all liberties - Bag2d is a double-indexed bag of 2d locations: - given any location, know whether it is in the bag or not, in O(1) - can iterate the locatinos, O(1) per location iterated - can erase a location in O(1) ... so now just need to link these together, and pump out the binary data file On 1/11/15, Hugh Perkins hughperk...@gmail.com wrote: Thinking about datasets for CNN training, of which I lack one currently :-P Hence I've been using MNIST , but also since MNIST results are widely known, and if I train with a couple of layers, and get 12% accuracy, obviously I know I have to fix something :-P But now, my network consistently gets up into the 97-98%s for mnist, even with just a layer or two, and speed is ok-ish, and probably want to start running training against 19x19 boards instead of 28x28. The optimization is different. On my laptop, an OpenCL workgroup can hold a 19x19 board, with one thread per intersection, but 28x28 threads would exceed the workgroup size. Unless I loop, or break into two workgroups, or something else equally buggy, slow, and high-maintenance :-P So, I could crop the mnist boards down to 19x19, but whoever heard of training on 19x19 mnist boards? So, possibly time to start hitting actual Go boards. Many other datasets are available in a standardized generic format, ready to feed into any machine learning algorithm. For example, those provided at libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ , or mnist, yann.lecun.com/exdb/mnist/ . The go datasets are not (yet) available in any kind of standard format so I'm thinking, maybe that could be useful to do so? But there are three challenges: 1. what data to store? Clark and Storkey planes? Raw boards? Maddison et al planes? Something else? For now, my answer is: something corresponding to an actual existing paper, and Clark and Storkey's network has the advantage of costing less than 2000usd to train, so that's my answer to 'what data to store?' 2. copyright. gogod is apparently a. copyrighted as a collection b. compiled by hand as a result of painstakingly going through each game, move by move, and entering into the computer, one move at a time. Probably not really likely that one could put this, even preprocessed, as a standard dataset? However, the good news is that the gks dataset seems publically available, and big, maybe just use that? 3. size . this is where I dont have an answer yet. - 8 million states, where each state is 8 planes * 351 locations = 20GB :-P - the raw sgfs only take 3KB per game, for a total of about 80MB, but needs a lot of preprocessing, and if one were to feed each game through, in order, might not be the best sequence for effective learning? - current idea: encode one column through the planes as a single byte? For Clark and Storkey they only have 8 planes, so this should be easy enough :-) - which would be 2.6GB instead - but still kind of large, to put on my web hosting :-P I suppose a compromise could be needed, which would also solve problem number 1 somewhat, of just providing a tool, eg in Python, or C, or Cython, which will take the kgs downloads, possibly the gogod download, and transform it into a 2.6GB dataset, ready for training, and possibly pre-shuffled? But this would be quite non-standard, although this is not unheard of, eg for imagenet, there is a devkit http://image-net.org/challenges/LSVRC/2011/index#devkit Maybe I will create a github project, like 'kgs-dataset-preprocessor'? Could work something like ?: python kgs-dataset-preprocessor.py [targetdirectory] Results: - the datasets are downloaded from http://u-go.net/gamerecords/ - decompressed - loaded once at a time, and processed into a 2.5GB datafile, in sequence (clients can handle shuffling themselves I suppose?) Thoughts? Hugh ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Datasets for CNN training?
Why don’t you make a dataset of the raw board positions, along with code to convert to Clark and Storkey planes? The data will be smaller, people can verify against Clark and Storkey, and they have the data to make their own choices about preprocessing for network inputs. David -Original Message- From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of Hugh Perkins Sent: Sunday, January 11, 2015 12:24 AM To: computer-go Subject: [Computer-go] Datasets for CNN training? Thinking about datasets for CNN training, of which I lack one currently :-P Hence I've been using MNIST , but also since MNIST results are widely known, and if I train with a couple of layers, and get 12% accuracy, obviously I know I have to fix something :-P But now, my network consistently gets up into the 97-98%s for mnist, even with just a layer or two, and speed is ok-ish, and probably want to start running training against 19x19 boards instead of 28x28. The optimization is different. On my laptop, an OpenCL workgroup can hold a 19x19 board, with one thread per intersection, but 28x28 threads would exceed the workgroup size. Unless I loop, or break into two workgroups, or something else equally buggy, slow, and high-maintenance :-P So, I could crop the mnist boards down to 19x19, but whoever heard of training on 19x19 mnist boards? So, possibly time to start hitting actual Go boards. Many other datasets are available in a standardized generic format, ready to feed into any machine learning algorithm. For example, those provided at libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ , or mnist, yann.lecun.com/exdb/mnist/ . The go datasets are not (yet) available in any kind of standard format so I'm thinking, maybe that could be useful to do so? But there are three challenges: 1. what data to store? Clark and Storkey planes? Raw boards? Maddison et al planes? Something else? For now, my answer is: something corresponding to an actual existing paper, and Clark and Storkey's network has the advantage of costing less than 2000usd to train, so that's my answer to 'what data to store?' 2. copyright. gogod is apparently a. copyrighted as a collection b. compiled by hand as a result of painstakingly going through each game, move by move, and entering into the computer, one move at a time. Probably not really likely that one could put this, even preprocessed, as a standard dataset? However, the good news is that the gks dataset seems publically available, and big, maybe just use that? 3. size . this is where I dont have an answer yet. - 8 million states, where each state is 8 planes * 351 locations = 20GB :-P - the raw sgfs only take 3KB per game, for a total of about 80MB, but needs a lot of preprocessing, and if one were to feed each game through, in order, might not be the best sequence for effective learning? - current idea: encode one column through the planes as a single byte? For Clark and Storkey they only have 8 planes, so this should be easy enough :-) - which would be 2.6GB instead - but still kind of large, to put on my web hosting :-P I suppose a compromise could be needed, which would also solve problem number 1 somewhat, of just providing a tool, eg in Python, or C, or Cython, which will take the kgs downloads, possibly the gogod download, and transform it into a 2.6GB dataset, ready for training, and possibly pre-shuffled? But this would be quite non-standard, although this is not unheard of, eg for imagenet, there is a devkit http://image- net.org/challenges/LSVRC/2011/index#devkit Maybe I will create a github project, like 'kgs-dataset-preprocessor'? Could work something like ?: python kgs-dataset-preprocessor.py [targetdirectory] Results: - the datasets are downloaded from http://u-go.net/gamerecords/ - decompressed - loaded once at a time, and processed into a 2.5GB datafile, in sequence (clients can handle shuffling themselves I suppose?) Thoughts? Hugh ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Datasets for CNN training?
Why don’t you make a dataset of the raw board positions, along with code to convert to Clark and Storkey planes? The data will be smaller, people can verify against Clark and Storkey, and they have the data to make their own choices about preprocessing for network inputs. Well, a lot of the data are dynamic, eg 'moves since last move', and cannot be obtained by looking at a single, isolated position. The most compact way of representing the information required is the sgf files in fact... What I'm thinking of doing is making the layers created some kind of options to the script, like I want 3 layers for liberties, no matter which side, and one layer for illegal moves, and ... etc, something like that? As far as downloading the data, all the sgfs, the script already does that. Actually, the script is pretty much finished, as far as Clark and Storkey layers, just need to debug a bit... ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go