Name: Reinforcement Learning for Chain of Thought Reasoning: A Case Study Using Tic-Tac-Toe
Rating: 4 (1 reviews)

Manny

Author 45 books16k followers

September 15, 2024

[Original review, Jul 24 2024}

Last month, Not and I were talking over lunch with our friend H about Leopold Aschenbrenner's already famous essay Situational Awareness . One of the many interesting things Aschenbrenner says is that we're running out of data to train AIs. They'll soon have eaten the whole internet. Worse, the data you find there isn't really the data you want. Everything is moving in the direction of Chain of Thought reasoning (basically: tell the AI to think aloud, because experience shows this is more accurate), and there's depressingly little data to scrape which might be directly useful for CoT. But this doesn't have to be a showstopper. AlphaZero became the world's best chess and Go player by creating its own training data. Maybe there are ways to do the same here? Aschenbrenner was optimistic that they could be found.

I said I thought I saw a way to get started. As everyone now knows, ChatGPT-4 is hilariously bad at Tic-Tac-Toe. But this is a very easy game, and you should be able to play reasonably well by thinking out loud. Suppose you emulated the AlphaZero methodology and told it to play Tic-Tac-Toe in a CoT mode? You log everything and save the instances where it got things right as input to the next cycle. With a bit of luck, its thinking will start to clarify, and it will improve.

Not and H agreed that this sounded like something which might work. Two weeks ago, I accompanied Not to a bridge tournament. While her team was playing a rather more interesting game, I sat in our pleasant hotel room and talked with ChatGPT-4 about how to implement the idea we'd come up with over lunch. As always, it was a bit more complicated than we'd first imagined, but we found ways to get round the technical issues: it helps to have a smart AI on your side. A couple of days ago we completed a substantial experiment, where our CoT Tic-Tac-Toe player went 40 rounds against five other players we'd implemented for it to practice against. At the beginning, it was averaging 5.6/10 for each round; by the end, this had climbed to a more reputable 7/10. When we analysed the move decision quality using a perfect Tic-Tac-Toe player that Chat had put together, we found that average correctness had gone up even more, from 83% to 92%. Both improvements are statistically significant.

I've just posted a paper summarising our work. We're curious to hear what people think! Has it already been done by someone else? A quick search didn't find any hits, and Aschenbrenner's essay suggests that he wasn't aware of anything either. But that was nearly two months ago, and with the Singularity fast approaching two months is a long time...
________________
[Update, Sep 15 2024}

OpenAI's new o1 model, previously known as "Strawberry", is out, and, as the whole world now knows, it uses reinforcement learning and Chain of Thought reasoning. It's impressive. One of the first tests I tried was of course to play a game of Tic-Tac-Toe against it. I gave it no specific instructions and went first. It played perfectly, holding the draw without problems, and in the CoT trace I could see it spotting all my threats and deciding to block them.

Some thoughts:

1. We were on the right track, but it turns out that OpenAI was way ahead of us the whole time.

2. The reason we started looking at the idea was that Aschenbrenner dropped some broad hints that it was worth investigating. He left OpenAI recently and it's likely he knew about "Strawberry", as it then was. Clearly he was sailing rather close to the wind when he said that everything in the Situational Awareness essay was based on publicly available information.

3. Immediate consolation prize: people are looking at our paper. Last week, ResearchGate logged no reads. This week we have 93 reads and counting.

4. More seriously: I don't think we wasted our time. This is clearly a very powerful technique, and getting a head start on learning how to use it was good. We're already trying to apply it in new ways, both with and without o1.

chat-gpt games not-the-whole-truth

Reinforcement Learning for Chain of Thought Reasoning: A Case Study Using Tic-Tac-Toe

ChatGPT-4 C-LARA-Instance, Manny Rayner

About the author

ChatGPT-4 C-LARA-Instance

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?