Jump to ratings and reviews
Rate this book

Reinforcement Learning for Chain of Thought Reasoning: A Case Study Using Tic-Tac-Toe

Rate this book
"Chain of Thought" (CoT) is an increasingly popular methodology for querying Large Language Models, where the LLM is told to formulate a response by in effect thinking aloud. CoT is known to increase accuracy in many domains. A practical problem, however, is that there is generally little available data from which to extract examples to guide the CoT process. Here, we describe an experiment in which we used a reinforcement learning method to evolve useful few-shot examples for CoT in the context of the game of Tic-Tac-Toe. The experiment consisted of 40 cycles. In each cycle, the CoT player played two games each against five different players, including a random player and a perfect minimax player. The CoT protocols were then filtered, with the successful ones used in the next round. The CoT player's smoothed average cycle score improved from 5.6/10 at the beginning to 7.0/10 at the end; the smoothed move decision correctness, as evaluated by the minimax player, improved from 83% to 92%. Other techniques used include matching few-shot examples to positions, majority voting, and parallel execution of GPT-4 queries. We discuss the significance of the findings and suggest ways to apply similar methods to an ongoing project where GPT-4 is used to produce annotated multimodal texts for language learning.

24 pages, ebook

Published July 24, 2024

14 people want to read

About the author

ChatGPT-4 C-LARA-Instance

7 books1 follower

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
0 (0%)
4 stars
2 (100%)
3 stars
0 (0%)
2 stars
0 (0%)
1 star
0 (0%)
Displaying 1 of 1 review
Profile Image for Manny.
Author 45 books16k followers
September 15, 2024
[Original review, Jul 24 2024}

Last month, Not and I were talking over lunch with our friend H about Leopold Aschenbrenner's already famous essay Situational Awareness . One of the many interesting things Aschenbrenner says is that we're running out of data to train AIs. They'll soon have eaten the whole internet. Worse, the data you find there isn't really the data you want. Everything is moving in the direction of Chain of Thought reasoning (basically: tell the AI to think aloud, because experience shows this is more accurate), and there's depressingly little data to scrape which might be directly useful for CoT. But this doesn't have to be a showstopper. AlphaZero became the world's best chess and Go player by creating its own training data. Maybe there are ways to do the same here? Aschenbrenner was optimistic that they could be found.

I said I thought I saw a way to get started. As everyone now knows, ChatGPT-4 is hilariously bad at Tic-Tac-Toe. But this is a very easy game, and you should be able to play reasonably well by thinking out loud. Suppose you emulated the AlphaZero methodology and told it to play Tic-Tac-Toe in a CoT mode? You log everything and save the instances where it got things right as input to the next cycle. With a bit of luck, its thinking will start to clarify, and it will improve.

Not and H agreed that this sounded like something which might work. Two weeks ago, I accompanied Not to a bridge tournament. While her team was playing a rather more interesting game, I sat in our pleasant hotel room and talked with ChatGPT-4 about how to implement the idea we'd come up with over lunch. As always, it was a bit more complicated than we'd first imagined, but we found ways to get round the technical issues: it helps to have a smart AI on your side. A couple of days ago we completed a substantial experiment, where our CoT Tic-Tac-Toe player went 40 rounds against five other players we'd implemented for it to practice against. At the beginning, it was averaging 5.6/10 for each round; by the end, this had climbed to a more reputable 7/10. When we analysed the move decision quality using a perfect Tic-Tac-Toe player that Chat had put together, we found that average correctness had gone up even more, from 83% to 92%. Both improvements are statistically significant.

I've just posted a paper summarising our work. We're curious to hear what people think! Has it already been done by someone else? A quick search didn't find any hits, and Aschenbrenner's essay suggests that he wasn't aware of anything either. But that was nearly two months ago, and with the Singularity fast approaching two months is a long time...
________________
[Update, Sep 15 2024}

OpenAI's new o1 model, previously known as "Strawberry", is out, and, as the whole world now knows, it uses reinforcement learning and Chain of Thought reasoning. It's impressive. One of the first tests I tried was of course to play a game of Tic-Tac-Toe against it. I gave it no specific instructions and went first. It played perfectly, holding the draw without problems, and in the CoT trace I could see it spotting all my threats and deciding to block them.

Some thoughts:

1. We were on the right track, but it turns out that OpenAI was way ahead of us the whole time.

2. The reason we started looking at the idea was that Aschenbrenner dropped some broad hints that it was worth investigating. He left OpenAI recently and it's likely he knew about "Strawberry", as it then was. Clearly he was sailing rather close to the wind when he said that everything in the Situational Awareness essay was based on publicly available information.

3. Immediate consolation prize: people are looking at our paper. Last week, ResearchGate logged no reads. This week we have 93 reads and counting.

4. More seriously: I don't think we wasted our time. This is clearly a very powerful technique, and getting a head start on learning how to use it was good. We're already trying to apply it in new ways, both with and without o1.
Displaying 1 of 1 review

Can't find what you're looking for?

Get help and learn more about the design.