Which ChatGPT Version, for What Tasks?

Andrej Karpathy offers insight in his June 2 Tweet

Jun 18, 2025

Andrej Karpathy was previously the Director of AI at Tesla, founding team member at OpenAI, CS PhD at Stanford., and is currently engaged with training deep neural nets on large datasets at karpathy.ai. He started Eureka Labs, a new AI+Education company, and provides AI tutorials on YouTube.

He gives insight in how he uses the various versions of ChatGPT. Here’s his Tweet: https://x.com/karpathy/status/1929597620969951434

To briefly summarize:

He uses GPT-4o 40% of the time for “Any simple query (e.g. ‘what foods are high in fiber’?)”

Model o3 for another 40%, for “Any hard/important enough query where I am willing to wait a bit.” Andrej says, “(I)f you are using ChatGPT professionally and not using o3 you're ngmi.” That’s “Not Gonna Make It.”

Vibe coding constitutes 10% of his time using GPT-4.1.

The last 10% is for Deep Research on the Tools menu, when “I want GPT to go off for 10 minutes, look at many, many links and summarize a topic for me.”

This is the graphic he provided in this Tweet. Over 600 comments have critiqued his observations.

Here some added discussion, to help illuminate the differences among versions:

GPT-4o: fast and the generalist. For the widest variety of use cases. Casual Q&A. What’s happening in a photo, summaries of articles, etc. Sounds smart and confident, but can hallucinate. No number crunching or critical code should be attempted. Easy to medium tasks.

o3: uses advanced reasoning when logic is important. If you’re on the $200 plus plan, you have o3-pro. You can watch its reasoning play out. Multi-step reasoning with sources. Logic problems and planning. Coding help. Structured arguments provided. Can search the Internet for the best data available.

Deep Research: Begins by asking you questions. Calls in faster models for simpler tasks. Preferred for presentations, academic work, in-depth analysis. Use it when citations are needed.

GPT-4.5 (preview): The wordsmith. Strength is tone. “using vivid, sensory language,” might be one prompt. Strong voice, tone, or emotion. I would never use this for my writing. It’s a strong tipoff this writing came from AI.

GPT-4.1 for vibe coding, as mentioned above. A particular strength is that when it is accessed via an API, the context window can be as large as one million tokens, which comes in handy for large documents or code bases. API usage is billed separately from ChatGPT subscriptions. When using the ChatGPT interface context window, all models are limited to 32,000 tokens.

GPT-4.1-mini is the junior version of 4.1, making it great for quick coding and analysis, but not as reliable. It still supports the million-token context window through the API, and could save money in usage.

o4-mini is considered a good balance in providing speed and reliable reasoning. I was given 100 free credits when I used the API in my n8n project. it’s suggested to be a great pick if you want something smarter than 4o, but faster than o3, and cheaper.

o4-mini-high is the upgrade with more compute per token, will extend past o3 rate limits.

o3-pro is o3 with more compute per token. That extra reasoning cost more, and is slower. It’s considerate the ultimate arbiter for very important tasks.

CogSciAI

Discussion about this post