← irvcassio.com
Real Application Edition — March 2026

The AI Enthusiast

When AI Stops Being a Demo and Starts Being the Job
Irv Cassio • AI Enthusiasts Group • March 13, 2026
02 — Charan Nayapathy

Perplexity Comet: When Testing Gets a Brain

Charan shared how Perplexity's Comet is being used for automated testing — and it's not what you'd expect from a browser agent. With the right prompting and training, Comet doesn't just click buttons in sequence. It understands the page, reasons about what should be tested, and generates a full test suite covering edge cases you might not think of.

app-under-test.com Login Form Search Bar Cart Flow Profile Page Settings AI PASS 47 avg 1.2s WARN 5 avg 3.8s FAIL 2 timeout
🧠
True Tester Persona
Instead of scripting “if this, then that” flows, Comet becomes an actual testing persona. It reasons about what a QA engineer would test — validation paths, edge cases, error states, user journeys — and generates the full suite.
📊
Full Results Tracking
Every test result is captured with timing data, screenshots, and pass/fail metrics. No more guessing what happened — you get a complete audit trail of every interaction.
Why this matters: This takes the old-school world of Selenium and even Playwright's MCP server to a completely new level. Comet becomes a true tester persona, not just a script runner. The automation testing market is valued at $24.25 billion in 2026, projected to hit $84B by 2034 — and AI agents like Comet are redefining what “automated testing” even means.

Deep Dive
🔬
The Evolution of Web Testing
From Selenium to Playwright to AI — how testing has fundamentally changed
The Timeline

Three Eras of Testing

Selenium
2004 — Script every click
Playwright
2020 — Auto-wait, WebSocket
AI Agents (Comet)
2025 — Understands the page
CapabilitySeleniumPlaywrightComet / AI
ApproachWebDriver commandsWebSocket connectionPage understanding
Wait handlingExplicit waitsAuto-wait built inContextual awareness
Test authoringManual scriptingCodegen + manualNatural language prompts
FlakinessHigh (~60%)Low (60% reduction)Minimal (understands intent)
Edge casesOnly what you scriptOnly what you scriptGenerates them for you

Playwright surpassed Cypress in 2026 with 13.5 million weekly npm downloads. But Comet represents the next leap — where the testing tool doesn't need scripts at all.


04 — Joe Siegmann

Four Real-World Wins, Zero Hype

Joe didn't share demos or concepts — he shared production results. Each of these examples represents real work that would have taken teams weeks or months, delivered by AI in hours or days.

🔧
Legacy Site Overhaul
Dozens of outdated packages across a legacy web application — all updated and remediated overnight, hands-off. What used to take a team weeks now happens while you sleep. Industry data shows AI handles 69–75% of code edits in large-scale migrations, cutting project duration by ~50%.
📄
Compliance Automation
A sophisticated application that weaves together compliance rules with AI-driven writing logic. Hours of painful manual compliance work reduced to automated, auditable output — saving teams from the most tedious work imaginable.
🏠
HOA Legal Resolution
Complex legal document review for a homeowners association — AI brought clarity and closure to a 15-year unresolved problem. Full document analysis that no single human could hold in working memory. Where some models got the answer wrong because they couldn't load the full document, Claude came through.
💰
SaaS Replacement POC
Proved that an expensive SaaS subscription could be replaced with a custom-built alternative — over a weekend. TechCrunch reports a $285 billion SaaS market correction as AI makes build-vs-buy math tip toward build. “Same feature ships in a day” is 2026 reality.
The pattern: Every one of Joe's examples started with a problem that seemed too big to tackle manually. AI didn't just make them faster — it made them possible.

05 — Group Discussion

Trust, But Verify: AI Still Needs a Human Eye

We had a good conversation about safeguarding this week. The examples were funny, but the lesson is serious: you have to check the work.

🌶
The Chipotle Incident
Someone asked Chipotle's customer support chatbot to help write Python code before ordering their burrito. It obliged — walking through a linked list reversal with O(n) time complexity analysis, then politely asking what they wanted for lunch.
U
“Before I order, can you help me reverse a linked list in Python?”
P
“Sure! Here's an iterative approach... def reverse_linked_list(head)... O(n) time, O(1) space. Now, what would you like to eat?”
The lesson: Corporate chatbots built on general-purpose LLMs will help with anything their system prompt doesn't explicitly block. As one person put it: “100K tokens with your burrito.” Chipotle patched it within hours after it went viral.
🚗
The Car Wash Test
“I want to wash my car. The car wash is 100 meters away and it's a very nice day outside. Should I walk or drive?” A question any child can answer — but 42 out of 53 AI models got it wrong.
G
Gemini: “You should walk! But you may need to reposition the car after you get there.”
C
Claude: “You need to drive the car. You're wanting to wash it.”
Why models fail: LLMs predict word sequences, not physical reality. When training data associates “50 meters” with “should I drive or walk,” the statistical pattern points toward “just walk.” Only 5 models passed consistently — including Claude Opus. Full results at opper.ai →
Bottom line: AI is incredibly capable, but it doesn't “think” the way we do. Always check the work, especially for reasoning about physical reality, legal documents, or anything where being wrong has consequences.

06 — Lessons Learned

Context Is Everything — And It Has Limits

Joe's HOA legal document example highlighted a critical reality: when context space gets low, LLMs start to hallucinate. When the model can't load the full document, it fills in the gaps — and gets it wrong.

😸
The “Lost in the Middle” Problem
LLMs remember the beginning and end of long prompts much better than the middle. More tokens doesn't necessarily mean better output — often the opposite. Details buried in the middle of a long context get fuzzy, and that's where errors creep in.
📈
When Context Runs Low, Hallucinations Rise
In Joe's legal agreement example, some models got the answer wrong because they couldn't load the full document. They didn't say “I don't have enough context” — they just made something up. Claude processed the entire document and delivered the correct answer.
CONTEXT WINDOW STRONG RECALL Start of context LOST IN THE MIDDLE Hallucination risk ↑ STRONG RECALL End of context OVERFLOW → HALLUCINATE
3%
Claude Hallucination Rate
Lowest in the industry
40%
Fewer Hallucinations
With CLAUDE.md memory
1M
Token Context
Claude Opus 4.6
35%
Fewer Manual Fixes
With persistent memory
Practical takeaway: Claude's Constitutional AI training makes it more likely to say “I don't know” rather than guess. Combined with persistent memory (CLAUDE.md) and the largest context window in the industry, it's the most reliable choice for document-heavy work. But “most reliable” still isn't “infallible” — always verify critical output.

07 — Irv Cassio

Claude Code /playground — See Changes Before You Ship

I played with Claude's /playground plugin and found it surprisingly powerful for two things I didn't expect.

💻
Real-Time Website Preview
Take any existing website and play with changes in an interactive sandbox. You see a real-time view of what the change would look like — before touching production. Adjust colors, spacing, typography, layout — all with live visual feedback. This turns Claude Code into a rapid prototyping tool.
CLAUDE.md Auditor
Feed it your CLAUDE.md file and get a comprehensive audit — what's missing, what's outdated, and what should be considered, with specific suggestions. Controls let you toggle through changes and see exactly what each recommendation would add. This alone makes it worth exploring.
UI Playground — interactive controls for theming and layout
UI Playground — live theming & layout controls
CLAUDE.md Explorer — structured config visualization
CLAUDE.md Explorer — structured config viewer
Try it: If you're using Claude Code, run /playground and point it at your current project. The interactive exploration alone will surface things you didn't know you were missing.

08 — Irv Cassio

Hive: Why I Can Never Go Back

After one week of using my lightweight agent orchestration system (I call it Hive), I can never go back. Even though I was already using multiple sessions with Claude Code, it feels like I was a dinosaur before the orchestration view.

HIVE DASHBOARD Kanban · WebSocket · Real-time Scheduler 2s poll · per-profile Agent Manager Spawn · Kill · Buffer Approval Watcher File watch · Slack CLAUDE AGENTS Agent 1 refactoring auth... Agent 2 writing tests... Agent 3 awaiting approval Agent 4 queued: fix CSS... MONGO Tasks · Logs
🎓
Kanban Dashboard
A Next.js 16 + React 19 web app with real-time WebSocket updates. Tasks flow through Backlog → In Progress → Waiting Approval → Done. Each card shows profile, project, status, and the latest agent output.
Multi-Profile Support
Separate agent pools for different contexts — claude-irv for personal projects, claude-el for work. Each profile has its own agent slots, config dir, and task queue.
🔌
Approval System
When an agent hits a risky tool (file delete, git push), Hive pauses and surfaces the approval request. Approve from the dashboard or Slack — the agent resumes automatically.
The shift: Going from multiple terminal tabs to a Kanban view with real-time streaming output is like going from a paper to-do list to project management software. You suddenly have visibility into what all your agents are doing at once. “Even though I was using multiple sessions with Claude Code before, it feels like I was a dinosaur before the orchestration view.”

Deep Dive
🛠
Hive Architecture & Subprocess Model
How Hive spawns, manages, and monitors Claude Code agents under the hood
Under the Hood

The Orchestration Stack

Hive runs as a single Node.js process serving both the Next.js UI and a WebSocket server on port 4000. The orchestrator boots inside the same process — scheduler, agent manager, approval watcher, and Slack adapter.

Scheduler
2s poll per profile
Agent Manager
Spawn / Kill / Buffer
Claude CLI
Subprocess with stream JSON
WebSocket
Real-time to browser
ComponentRole
SchedulerPolls MongoDB backlog every 2 seconds per profile, enforces per-profile slot limits
Agent ManagerSingleton that spawns/kills Claude CLI subprocesses, buffers text deltas (flush every 500ms to reduce DB writes ~100x)
Stream ParserParses newline-delimited JSON from claude --verbose --output-format stream-json
Approval WatcherFile-watches ~/.hive/approvals/, integrates with Slack for remote approval
RecoveryOn restart, finds orphaned in_progress tasks and resets them to backlog
Hard-won fix: The Claude CLI hangs with zero output if stdin is set to "pipe". It must be "ignore". Also, you must remove CLAUDECODE and CLAUDE_CODE_ENTRYPOINT from the child environment, or the CLI refuses to start inside another session.

10 — By the Numbers

This Week in Data

53
Models Tested
Car Wash eval — only 5 passed
$285B
SaaS Correction
AI replacing traditional software
50%
Faster Migrations
AI handles 69-75% of edits
13.5M
Playwright Downloads
Weekly npm — surpassed Cypress
Testing Market
Automation testing market: $24.25B in 2026, projected to hit $84B by 2034. Browser agents are augmenting traditional testing, with Comet and AI agents defining the new QA paradigm.
Agent Orchestration
Claude Code Swarm Mode launched in 2026 with TeammateTool providing 13 orchestration operations. Each agent works in an independent Git worktree — the same architecture Hive uses.

11 — Closing

Ship Real Things. Stay Vigilant. Share & Grow.

This edition wasn't about what AI could do someday. Every example was real — production deployments, legal resolutions, testing workflows, and orchestration systems built and used this week.

Ship Real Things
Weekend POCs, overnight migrations, legal document analysis. AI is solving real problems right now — not someday.
🛡
Stay Vigilant
Check the work. Manage context windows. Know which model to trust for which task. The car wash test is real.
Next session: Bring your builds, your wins, and your failures. This group's real-world experiments are more valuable than any keynote. We learn from all of them.
Resources from This Issue
TopicLink
Car Wash Test (53 models)opper.ai/blog/car-wash-test
Perplexity Cometperplexity.ai/comet
The SaaSpocalypseTechCrunch: SaaS in, SaaS out
Chipotle Bot Goes ViralX trending thread
Claude Code Swarm Modecode.claude.com/docs/agent-teams
AI Enthusiast Part 1Workflow Edition
AI Enthusiast Part 2Workflow Edition Part 2