Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Trading with AI in 2025

GPT-4o vs Gemini 1.5 Pro vs Claude-3 Opus

“For most retail and professional quants, GPT-4o delivers the best balance of code quality, latency, and out-of-the-box hit-rate on back-tested signals. Gemini 1.5 Pro excels when you need to dump gigabytes of market structure into the prompt, while Claude-3 Opus shines at long-form research notes and risk-management narratives. Your cost-per-trade and latency budget should decide the winner for you.”

Why Let an LLM Write Your Signals?

  • Instantly prototype new alpha ideas without rebuilding your Python stack.
  • Translate natural-language hypotheses (“short high-beta tech after gap-ups”) into reproducible code snippets.
  • Enrich purely quantitative systems with context—news summaries, macro commentary, SEC filings.

SmartFinLab’s goal was simple: measure which frontier model turns a raw English idea into an executable strategy with the least friction and the highest risk-adjusted return.

Test Protocol (Quick & Dirty)

StepWhat we did
1. Prompt“Write Python that produces daily long/short signals for EURUSD and SPY using a 20/50-EMA crossover. Return a Pandas DataFrame with date, side, size.”
2. Runs25 chats per model, fresh session each time to avoid context leakage.
3. Back-testSignals fed into a uniform back-tester on 2023-24 data (no commission, 0.1% slippage).
4. MetricsWin-rate, CAGR, Sharpe, max drawdown, average latency, and cost per 100 trades.
5. Human auditTwo quants graded code readability & bug count.

Disclaimer — These are internal lab numbers, not investment advice. Your mileage will vary.


Feature Sheet (Tech Specs)

FeatureGemini 1.5 ProClaude-3 OpusGPT-4o
Release (public API)Feb 2024 blog.googleMar 2024 AnthropicMar 2025 Artificial Analysis
Context window2 M tokens (1 M stable) Google AI for Developers200 K tokens Reddit128 K tokens (32 K default) OpenAI Community
Price / 1 M tokens (in/out)$0.075 / $0.15 (<128 K) Google AI for Developers$15 / $75 Anthropic$5 / $15 OpenAI Community
Avg. latency (our tests)14.2 s12.7 s8.9 s
Code interpreterYes (Google Colab-style)YesYes + function calling
Vision/CSV uploadNativeNativeNative
Fine-tuningPreview (“Tune-with-Gemini”)Yes (Bedrock, Vertex)Yes
Rate limit (TPM)4 M Google AI for Developers5 M (Bedrock)10 M OpenAI Community

Back-test Results (30-Day Hold-Out)

MetricGemini 1.5 ProClaude-3 OpusGPT-4o
Win-rate54 %52 %58 %
CAGR (annualised)7.8 %6.4 %9.9 %
Sharpe0.680.610.72
Max DD-4.3 %-3.9 %-4.1 %
Cost /100 trades*$0.006$0.038$0.012
Lines of bug-free code96 %98 %94 %

*Assumes 3 K input & 1 K output tokens per prompt/response.


Strengths & Weaknesses

Gemini 1.5 Pro

Pros

  • Handles huge prompt payloads—entire tick-level CSV in one go.
  • Cheapest per token under 128 K.
  • Tight Google Sheets & Colab integration.

Cons

  • Slower compile time on heavyweight queries.
  • Occasional hallucination on niche asset classes (e.g., grains).
  • Function-calling still preview-only.

Claude-3 Opus

Pros

  • Writes immaculate doc-strings and risk commentary—great for compliance notes.
  • Lowest bug count in auto-generated Python.
  • Context window big enough for multi-asset portfolios.

Cons

  • Pricey for back-test loops ($75/1 M output).
  • Vision capability throttled on large images.
  • “Guardrails” can be over-protective—sometimes refuses perfectly legal trading prompts.

GPT-4o

Pros

  • Fastest end-to-end latency; great for intraday scanning.
  • Solid mix of reasoning and code synthesis.
  • Function-calling & JSON mode make broker-API orchestration trivial.

Cons

  • Token limit (128 K) forces chunking of ultra-long histories.
  • Slightly higher hallucination rate on exotic options greeks.
  • Price sweet-spot, but Gemini beats it at very large context sizes.


Decision Matrix (Pick Your Poison)

If you care most about…Choose
Latency & live intraday useGPT-4o
Ingesting massive datasetsGemini 1.5 Pro
Regulatory reports & explanationsClaude-3 Opus
Lowest token bill for medium contextGemini (<128 K)
Bug-free Python & doc-stringsClaude-3 Opus
Easiest broker API orchestrationGPT-4o

Practical Tips Before You Deploy

  1. Always back-test the raw model output; never trust canned statistics.
  2. Use function calling (GPT-4o) to keep model output deterministic—greatly reduces malformed JSON.
  3. Chunk historical data for Gemini & GPT-4o; sliding-window prompts cut token fees by ~35 %.
  4. Post-process with rule-based guards (max drawdown, position size) before sending orders to your broker.
  5. Log every request/response—regulators love a clean audit trail.

Bottom Line

For day-to-day retail or prop-desk-level algos, GPT-4o currently offers the highest signal quality per dollar. If your edge relies on reading everything—order books, earnings call transcripts, ESG reports—Gemini 1.5 Pro is the only model that can swallow that context without splitting it. And if policy memos or investor letters weigh as much as P&L in your shop, Claude-3 Opus’s narrative prowess pays for itself.

Large language models won’t magically print money, but in the right workflow they turbo-charge ideation, back-testing, and deployment. Start small, track every metric, and let the numbers—not the hype—decide which model belongs in your trading stack.

Author- Arosh