Re: [Computer-go] Standard Computer Go Datasets - Proposal
To answer the original question: yes, the curation of a dataset like this would be hugely beneficial to the community. Look at what ImageNet has done for computer vision. In fact, it might be good to emulate ImageNet further and pre-split the dataset into a publicly-available training set, and a hidden testing set, for truly objective comparisons between move-prediction algorithms. If you undertake this, many thanks in advance! On Fri, Nov 13, 2015 at 1:20 PM, Dave Dyer wrote: > > I was recently working on assigning final scores to completed games, using > the large data set from Badukmovies.com. > > My observation is that the size of the data set (50,000 games) is not > large enough to get good coverage of unusual situations occurring in real > games. > > There's a definite need for a curated collection of atypical but > interesting games, probably manipulated to explore the boundaries > between interesting and normal. > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
I was recently working on assigning final scores to completed games, using the large data set from Badukmovies.com. My observation is that the size of the data set (50,000 games) is not large enough to get good coverage of unusual situations occurring in real games. There's a definite need for a curated collection of atypical but interesting games, probably manipulated to explore the boundaries between interesting and normal. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
I was recently working on assigning final scores to completed games, using the large data set from Badukmovies.com. My observation is that the size of the data set (50,000 games) is not large enough to get good coverage of unusual situations occurring in real games. There's a definite need for a curated collection of atypical but interesting games, probably manipulated to explore the boundaries between interesting and normal. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
At least in the past some DCNN made use of the players ranks, so it should be best to leave it. On 11/13/2015 10:27 AM, Josef Moudrik wrote: On Fri, Nov 13, 2015 at 11:16 AM Erik van der Werf wrote: On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook wrote: The advantages of storing games: * accountability/traceability * for programs who want to learn sequences of moves. Another advantage of storing games is that it is much more efficient; you only have to encode one move per position. Erik Yes, I think that having full games would be much more useful. The anonymization of the I had in mind would include hiding information not important for computer processing such as file-names, player names, dates, ranks, comments (given that the dataset would ensure consistent "balanced" distribution). Like this, the database would have no (or much less) use for human study. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
I think if you start calculating the Zobrist hashes and scraping features yourself you will have a neverending variety of datasets. I would prefer datasets of whole, high quality games without SGF errors, perhaps cleaned of identifying information. Parsing an SGF is already trivial. I personally divide them in: - Handicap used or not - Normal (5.5 - 7.5) or not komi, this disqualifies some older games - Rules used - Board size Following the idea of having more information instead of very specific features already extracted, it would be interesting to also have the playing times, although I don't know where you'd get that from. You'd be an angel if you could provide a large dataset of matches with Chinese rules, specially in board sizes other than 19x19. It would of course also have to be completely free for any use. I personally only use the KGS 6d+ and a collection of 70k pro games that I don't know where it came from. The GoGoD is proprietary. :) Gonçalo F. On 11/13/2015 08:39 AM, Josef Moudrik wrote: Hello List, There has been some debate in science about making the research more reproducible and open. Recently, I have been thinking about making a standard public fixed dataset of Go games, mainly to ease comparison of different methods, to make results more reproducible and maybe free the authors of the burden of composing a dataset. I think that the current practice can be improved a lot. Since the success of this endeavor crucially depends on how many authors use the dataset, I would like to ask You (potential authors) a few questions: 1) Would this be welcomed and used? Would You personally use it? (Am I not reinventing the wheel?) 2) What parameters should the dataset have? The number of dataset variants (if any) should be in my opinion kept at bare minimum to reduce "fragmentation". 2a) Size: My current view is that at least 2 sizes are necessary: small (1000-2000 games?) and large dataset (5-6 games). 2b) Strength & year span: Currently I am thinking about including modern professional games only (1970-2015) 3) Do you have any other comments, requirements for the dataset and ideas? Thanks for Your attention, Kind regards Josef Moudrik ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
On Fri, Nov 13, 2015 at 11:16 AM Erik van der Werf wrote: > On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook wrote: >> >> The advantages of storing games: >> * accountability/traceability >> * for programs who want to learn sequences of moves. >> > > Another advantage of storing games is that it is much more efficient; you > only have to encode one move per position. > > Erik > Yes, I think that having full games would be much more useful. The anonymization of the I had in mind would include hiding information not important for computer processing such as file-names, player names, dates, ranks, comments (given that the dataset would ensure consistent "balanced" distribution). Like this, the database would have no (or much less) use for human study. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook wrote: > > The advantages of storing games: > * accountability/traceability > * for programs who want to learn sequences of moves. > Another advantage of storing games is that it is much more efficient; you only have to encode one move per position. Erik ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
Hi! On Fri, Nov 13, 2015 at 09:46:54AM +, Darren Cook wrote: > (I did wonder about storing player ranks, e.g. if a given position has a > move chosen by only a single 9p, and you can then extract each follow-up > position, you could extract a game. But, IMHO, you cannot regenerate any > particular game collection this way. If it is a concern, it can be > solved by only using a random 80% of moves from games.) Dropping player names and some positions is a nice idea - especially, from a moral standpoint, if the collection includes a prominent notice encouraging voluntary donations by the users to the source collection, e.g. GoGoD. (A technical notice: you want info about last + second-to-last move in the position as that's a feature that's often used in patterns. Plus, bridging over just a 1-3 moves seems pretty easy to do by brute force. A better scheme might be to drop, say, a block of 20 moves starting at move 40-80 at random.) I think a good question is what other uses besides learning move patterns do people envision. -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
Hello, On Fri, Nov 13, 2015 at 10:13 AM wrote: > I would only use it if it is licensed for commercial use. Yes, I would like to licence this as such, please see below. On Fri, Nov 13, 2015 at 10:23 AM Petr Baudis wrote: > I think the current de facto standard dataset is GoGoD (some year, not > quite fixed). So I think it's useful to differentiate your proposal > against this dataset - what are the current problems and what will be > the advantage? Yes, I know GoGoD is used frequently, but I think that the lack of "precise" specification is the problem. There are many choices an author has to make when using the GoGoD database: year of release, year span, handicap games?, amateur/professional? (how to tell? pro rank is d not p). Related thing is that some of the games (If I remember my experience correctly) cannot be parsed by some libraries in which case they are usually skipped. All these are branching points that make "precise" replication of results hard. > One advantage would be of course if the dataset is freely available. > But it's not clear how to achieve that, i.e. where to get a large > professional game collection without copyright protection. I consider this "negotiation" as the hardest work I will have to do, but before I start, I want to research if the dataset would be even used. From the point of view of copyright law, I believe that what is protected is the "collection of games" and "additional materials" (comments, etc), not the actual individual games themselves (which as a record of a historical event afaik cannot be copyrighted). The "collection of games" and "additional materials" right of current collection owners could be protected by anonymization of the records and mixing of different databases, if the current owners agree. >From the licensing point of view, again given that owners agree, I would like to release the dataset under something like free-for-all-purposes-with-attribution license. This I have to research yet. > What's the usecase for a small dataset? I had prototype testing in mind, s.t. authors can say "our method is slow, so we only tested on the SmallGoDataset" instead of "we randomly took 1000 games from the BigGoDataset", but I assume there would be other usecases as well. Anyway, I think the big and small datasets would not imo cause much use-fragmentation, because the use cases for big vs small would be different. But maybe I am overthinking things and this would not be used much.. Regards, Josef ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
> standard public fixed dataset of Go games, mainly to ease comparison of > different methods, to make results more reproducible and maybe free the > authors of the burden of composing a dataset. Maybe the first question should be is if people want a database of *positions* or *games*. I imagine a position database to be a set of board descriptions, with each pro move marked on it. Ideally each move would say not just the number of times it was chosen, but break it down by rank of player. Each would have a zobrist hash calculated, in all 8 combinations, and the lowest chosen. This handles rotations and duplicates. If there was as a ko-illegal point on the board that needs to be stored, and also be part of the zobrist hash. A database of positions has some advantages: * No licensing issues (*) * Rotational duplicates already removed * Ready-to-go with the information (most) programs want to learn. The advantages of storing games: * accountability/traceability * for programs who want to learn sequences of moves. Darren *: At least that was my conclusion when I looked into this before. Game collections can be copyrighted; moves cannot. A database of moves can be freely distributed, even it was generated from copyrighted game collections, as long as there exists no way to regenerate the game collection from it. Text corpora (used in machine translation studies, for instance) follow the same idea: if you split the corpora into sentences, then shuffle them up randomly, you can distribute the set of sentences. (I did wonder about storing player ranks, e.g. if a given position has a move chosen by only a single 9p, and you can then extract each follow-up position, you could extract a game. But, IMHO, you cannot regenerate any particular game collection this way. If it is a concern, it can be solved by only using a random 80% of moves from games.) ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
Hi! On Fri, Nov 13, 2015 at 08:39:20AM +, Josef Moudrik wrote: > There has been some debate in science about making the research more > reproducible and open. Recently, I have been thinking about making a > standard public fixed dataset of Go games, mainly to ease comparison of > different methods, to make results more reproducible and maybe free the > authors of the burden of composing a dataset. I think that the current > practice can be improved a lot. I think the current de facto standard dataset is GoGoD (some year, not quite fixed). So I think it's useful to differentiate your proposal against this dataset - what are the current problems and what will be the advantage? One advantage would be of course if the dataset is freely available. But it's not clear how to achieve that, i.e. where to get a large professional game collection without copyright protection. > 2a) Size: My current view is that at least 2 sizes are necessary: small > (1000-2000 games?) and large dataset (5-6 games). What's the usecase for a small dataset? -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Standard Computer Go Datasets - Proposal
I would only use it if it is licensed for commercial use. David On Fri, 13 Nov 2015 08:39:20 +, Josef Moudrik wrote: Hello List, There has been some debate in science about making the research more reproducible and open. Recently, I have been thinking about making a standard public fixed dataset of Go games, mainly to ease comparison of different methods, to make results more reproducible and maybe free the authors of the burden of composing a dataset. I think that the current practice can be improved a lot. Since the success of this endeavor crucially depends on how many authors use the dataset, I would like to ask You (potential authors) a few questions: 1) Would this be welcomed and used? Would You personally use it? (Am I not reinventing the wheel?) 2) What parameters should the dataset have? The number of dataset variants (if any) should be in my opinion kept at bare minimum to reduce "fragmentation". 2a) Size: My current view is that at least 2 sizes are necessary: small (1000-2000 games?) and large dataset (5-6 games). 2b) Strength & year span: Currently I am thinking about including modern professional games only (1970-2015) 3) Do you have any other comments, requirements for the dataset and ideas? Thanks for Your attention, Kind regards Josef Moudrik - ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
[Computer-go] Standard Computer Go Datasets - Proposal
Hello List, There has been some debate in science about making the research more reproducible and open. Recently, I have been thinking about making a standard public fixed dataset of Go games, mainly to ease comparison of different methods, to make results more reproducible and maybe free the authors of the burden of composing a dataset. I think that the current practice can be improved a lot. Since the success of this endeavor crucially depends on how many authors use the dataset, I would like to ask You (potential authors) a few questions: 1) Would this be welcomed and used? Would You personally use it? (Am I not reinventing the wheel?) 2) What parameters should the dataset have? The number of dataset variants (if any) should be in my opinion kept at bare minimum to reduce "fragmentation". 2a) Size: My current view is that at least 2 sizes are necessary: small (1000-2000 games?) and large dataset (5-6 games). 2b) Strength & year span: Currently I am thinking about including modern professional games only (1970-2015) 3) Do you have any other comments, requirements for the dataset and ideas? Thanks for Your attention, Kind regards Josef Moudrik ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go