another chatgpt moment

march 29, 2025

4 days ago, openai released gpt-4o native image output, and that has completely blown my mind since. i do get pretty excited for every new ai model drop, such as when i waited for claude 3.7 sonnet, gemini 2.5 pro, gpt-4o, o3-mini, grok-3, and the list goes on. the multimodal gpt-4o drop was the most excited i have been in a long time. on regular ai model drops, i usually just feed some code into it, or plug it into cursor, and watch it zero-shot some things that no other model can. then i begin using it as my default coding model, and the cycle repeats.

llms became fun again

the gpt-4o multimodal output drop reminded me of the time when chatgpt was released, when not a lot of people were coding or doing anything technical with it—people were just having fun: asking it questions, making it write stories. once i got the model, i felt a kind of fun i hadn't had with ai in a while. i watched it draw out lots of my past photos in a rough pencil sketch style, made it create a very good-looking ad just by taking a random phone photo of my starbucks. i made it draw comics about itself and also fed it lots of journals of mine to make comics about me. it's been a very long time since i felt this, but: ai was finally fun again.

lightbult abstracct art

art generated by gpt-4o image output

omnimodal output models are just getting started...

after the new series of reasoning models, and now these omnimodal output models (which, to be fair, openai did have internally for around a year), these models are just getting started in development. remember, gpt-4o is nowhere near a sota model—it's even considered legacy in a way by many people and can't even do any reasoning. when we truly apply multimodal output to a model at the scale of gpt-4.5, we'll see another leap in capabilities that we probably can't even imagine right now.

we'll see even more interesting results when we apply multimodal output to reasoning. i can't quite imagine what it will look like, but imagine reasoning models thinking in the image output modality along with text—that will be extremely interesting and promising. personally, i couldn't be more excited to see what inference-time scaling will look like with multimodal output.

our future is bright. i believe soon there will be a day when ais scale beyond our wildest dreams.