Show HN: Experiments in AI-generation of crosswords

32 points by abstractbill 21 hours ago

Hi HN, I've been experimenting on-and-off over the years trying to automatically generate crosswords [1]. Recently I've been feeling like my results are good enough that I want to share them and see what other people think. I'm not trying to claim that these could appear in, say, the NYT in their current state, but honestly the velocity of progress makes me feel like I will inevitably be able to automatically generate NYT-quality crosswords within just a year or so.

A write-up is here: https://abstractnonsense.com/crosswords.html

And you can play the crosswords here: https://crosswordracing.com (They should work well on both desktop and mobile, and there's a leader-board for each crossword if you want to leave your name when you solve one).

[1]: Just in case anyone is interested, my very first attempt at this problem was way back in 2006! I used multiple wordlists (e.g. list of British monarchs, with reign dates), and wrote little functions to generate clues from each list (e.g. "British monarch who ruled from {date1} to {date2}"). Even with randomized synonym substitution and similar tricks, this approach was too labor-intensive, and the results too robotic, for it to work well. Can't complain though, that project led to me getting hired as the first engineer at Justin.TV!

vunderba 20 hours ago

Not bad.

As someone who has dabbled in AI generated crosswords I found that providing samples of "good crossword clues" (which I curated from historical NYT monday puzzles) as part of the LLM context helped tremendously in generating better clues.

There was also a Show HN for a generative AI crossword puzzle system a few months ago so I'll include what I mentioned there:

Part of the deep satisfaction in solving a crossword puzzle is the specificity of the answer. It's far more gratifying to answer a question with something like "Hawking" then to answer with "scientist", or answering with "mandelbrot" versus "shape".

So ideally, you want to lean towards "specificity" wherever possible, and use "generics" as filler.

Link:

https://news.ycombinator.com/item?id=41879754

abstractbill 20 hours ago

Thanks. Yes, specificity of solutions seems like a good metric to optimize for.
In some of my crosswords I get clues that are specific in clever ways (e.g. one of these has "Extreme, not camping" which I thought was really strange until I found the answer "intense" and was very impressed by that level of wordplay from an LLM!)

korymath 19 hours ago

Great post.

Funny, I just posted this to X

2025 GenAI challenge

Create a 5x5 crossword puzzle with two distinct solutions. Each clue must work for both solutions. Do not use the same word in both solutions. No black squares.

I try with each new model that lands. Still can’t get it.

alberto_balsam 19 hours ago

Do you know if there is a solution to this by humans? I'd be interested in seeing it.
- korymath 18 hours ago
  
  I've not found a solution at any NxN size made by human or machine.
  - quuxplusone 15 hours ago
    
    You might get a little closer by tweaking the prompt — you're asking the LLM to "figure out" that the first step is to create two 5x5 word squares with no repeated words, and then the second step is to solve ten requests of the form, "Give me a crossword-style clue that could be reasonably solved by either the words OPERA or the word TENET" (for each of the ten word-pairs in your square-pairs). However, LLMs are based on tokens, and thus fundamentally don't "understand" that words are made out of letters — that's why we have memes about their inability to count the number of "r"s in "strawberry" and so on. So we shouldn't expect an LLM to be able to perform Step 1 at all, really. And Step 2 requires wordplay and/or lateral thinking, which LLMs are again bad at. (They can easily do "Give me a crossword-style clue that could be solved by the word OPERA," because there are databases of such things on the web which form part of every LLM's dataset. But there's no such database for double-solution clues.)
    Generating a 5x5 word square (with different words across and down, so not of the "Sator Arepo" variety) is already really hard for a human. I plugged the Wordle target word list into https://github.com/Quuxplusone/xword/blob/master/src/xword-f... to get a bunch of plausible squares like this:
    SCALD POLAR ARTSY CEASE ERROR
    But you want two word squares that can plausibly be clued together, which is (not impossible, but) difficult if matching entries aren't the same part of speech. For example, cluing "POLAR" together with "ARTSY" (both adjectives) seems likely more doable than cluing "POLAR" together with "LASSO" (noun or verb).
    Anyway, here's my attempt at a human solution, using the grid above — and another grid, which I'll challenge you to find from these clues. Hint: All but two of the ten pairs match, part-of-speech-wise.
    1A. Remove the outer layer of, perhaps 2A. Region on a globe 3A. Like some movie theaters 4A. Command to a lawbreaker 5A. Rhyme for Tom Lehrer? 1D. ____yard (sometime sci-fi setting) 2D. It goes something like this: Ꮎ 3D. Feature of liturgy, often 4D. It's vacuous, in a sense 5D. Fino, vis-a-vis Pedro Ximénez
abstractbill 17 hours ago

Thanks!
That's a wonderfully hard problem, I'd love to see it get solved.
echelon 19 hours ago

That's algorithmically hard.
Ask the LLM to generate a program to solve the problem.
- korymath 18 hours ago
  
  I've tried that, as recently as today with latest Gemini, Claude, and o1 ... none have been successful.

corlinpalmer 16 hours ago

Awesome! I have also dabbled in AI-generated crosswords, but I was more fascinated with the concept of generating the most efficient layout of an X-by-X grid from a given word set. It's a surprisingly difficult optimization problem because the combinatorics are insane. Here's an example output trying to find the most efficient layout of common Linux terminal commands:

    W   P     G   
    H I S T O R Y 
    E         O   
    R   T   Y U M 
  L E S S     P   
    I   O   C A T 
  U S E R A D D   
  L     T R   D C

Of course this is a pretty small grid and it gets more difficult with size. I've thought about making a competition from this sort of challenge. Would anyone be interested?

abstractbill 16 hours ago

Yes! That a really fun problem too -- it feels like it should be tractable but it's insanely hard. If you do start some kind of competition around it, let me know, I'd be interested.

furyofantares 20 hours ago

I've tried to get o1 to generate Xordle puzzles.

Warning: post contains a spoiler for a recent Xordle.

Xordle is Wordle with two target words that share no letters in common. Additionally, there is a "free clue" given at the start, and all three words are thematically linked. It's not always a straightforward link, for example a recent puzzle had the starter word 'grief' and targets 'empty' and 'chair'. All puzzles today are selected from user submissions.

o1 is the first model that's been able to solve Xordles reliably, or to generate valid puzzles at all. It's well-known that these things are massively handicapped for this type of task due to tokenization.

But since o1 can in fact achieve it, I wanted to see if I could get it to make puzzles that are at all satisfying. Instead it makes very bland puzzles, with straightforward connections and extremely broad themes.

Prompting can swing the pendulum too far in the other direction, to puzzles where the connection is contrived and impossible to see even after it's solved. As I've often experienced with LLMs, being able to hit either side of a target with prompting does not necessarily mean you can get it to land in the middle, and in fact I have had no success in doing so with this task.

This is one of the most basic examples I know of lack of creativity or "taste" to an LLM. It is a little hard for a human to generate two 5-letter words with no overlap, but it is extremely easy for a human to look for a thematic connection among 2-3 words and say if it's satisfying. But so far I've been totally unable to make the LLM make satisfying puzzles.

edit: Nothin' like making a claim about LLMs to get one up off one's ass and try to prove it wrong immediately. I'm getting some much better results with better examples now.

IanCal 20 hours ago

Have you tried using an llm to say whether the puzzles are good or not?
abstractbill 20 hours ago

Great observation, yeah, I've had very similar experiences with prompting, exactly as you said -- one direction giving very bland literal clues, and the opposite direction giving clues that are a stretch even when you know the answer!

gowld 20 hours ago

The "American" grids aren't American. An American grid almost always has 2 answers (both directions) per square.

abstractbill 20 hours ago

Oh that's really interesting thanks! That would actually be an easy constraint to add too.
- quuxplusone 15 hours ago
  
  American-style crossword construction has a number of constraints, some bendable, some not.
  - Every cell must be "keyed," i.e., part of a word Across and a word Down. Unkeyed cells are strictly forbidden.
  - No word may be less than 3 letters. Two-letter words are strictly forbidden.
  - The grid must be rotationally symmetric. (But this rule can be broken for fun. Bilaterally symmetric grids are relatively common these days. Totally asymmetric grids are very rare and always in service of some kind of fun — see https://www.xwordinfo.com/Thumbs?select=symmetry )
  - No more than one-sixth of the squares can be black. (But this rule can be broken, usually either to make the puzzle less challenging by shortening the average word length, or to make the creator's life easier in order to achieve some other feat.)
  - If a single black square is bordered on two adjoining sides by other black squares, then it could be turned white without destroying the other properties of the grid. Such black squares are called "cheaters" and are frowned upon. (Though they might serve a purpose, e.g. to fit a specific theme entry's length.)

dgreensp 19 hours ago

I found this article a bit disappointing.

The link at the bottom doesn’t work.

The grids shown do not follow the well-known rules of (American) crosswords: every square is part of two words of three or more letters each.

Coming up with a pattern of black squares, and writing good clues, are two parts of making a crossword puzzle that are IMO fun and benefit from a human touch, and are not overly difficult. There are also databases of past clues used in crossword puzzles (eg every NY Times clue ever, and various crossword dictionaries) for reference and possible training. If you don’t care about originality (or copyright) and want quality clues, you can just pull clues from these. If you do care about all those things, you can surface the list of clues used in the past to the human constructor and let them write the final clue. Or you can try to perfect LLM clue-writing. In my experience, LLMs are terrible at clues. Like sometimes if I try to give it feedback about a clue, it will just work the feedback into the clue… it’s a little hard to describe without an example, but basically it doesn’t seem to understand the requirements of a clue and the process of a solver looking at a clue and trying to come up with an answer.

Coming up with an interlocking set of fun, high-quality words and phrases is the hard part. I agree that LLM wordlist curation is a great idea, and I started playing around with that once.

Beyond that, I don’t think LLMs can help with grid construction, which is a more classic combinatorial problem.

abstractbill 16 hours ago

> The link at the bottom doesn’t work.
Can you clarify which link is broken and how? What browser and OS?
> In my experience, LLMs are terrible at clues.
That hasn't been my experience. Without good prompting they give you clues that are too bland and literal, but it is quite possible to get them to give you clues with interesting and creative wordplay. I wish it was easier to get clues like that more consistently, but it's certainly doable. I still believe within a year it'll be easy.