Lessons from the Cockpit: Key Takeaways from Our Midwest Air Tour with Shirley

At FlyShirley, we believe the future of flight is driven by communication intelligence. That's why we put Shirley, our in-flight AI assistant, through a 9-day Air Tour across the Midwest!

Apr 01, 2026

Shirley Goes West Air Tour, February 2026

At FlyShirley, we believe the future of flight is driven by communication intelligence. That’s why we put Shirley, our in-flight AI assistant, through a critical test: a 9-day air tour across the Midwest — Louisville, Oshkosh, Gary, Lafayette, and more. This debrief outlines what we learned from flying with Shirley across the Midwest, and what’s changed since. By doubling down on the system’s strengths and digging into the places where it came up short, we’re ensuring the system becomes the most insightful and reliable support system a pilot can have.

What the Air Tour Taught Us

Real-world aviation is a high-stakes, noisy, and chaotic environment — bringing our flight assistant into it quickly exposed some challenges:

Connectivity and Data Reliability

Shortly after our first takeoff from I69 airport in Batavia, we noticed our Starlink was looking a little loose during run-up. We took off anyway. Moments later, the suction cups gave out and the antenna ended up on the floor behind the rear seats, where we couldn’t reach it — pointing through the tail cone of the aircraft. What was really interesting was that it still had internet connectivity in that position. Download was still surprisingly good. Upload was bad. And it turns out we need that.

We learned the very practical lesson of ensuring physical system components are accessible for debugging. But we also proved something important about the architecture: a weak connection doesn’t have to stop Shirley from working well. We’re in the dial-up age of communication with models — a 56-kilobit connection is actually pretty good for exchanging text with a fast oracle on the ground. That was never the problem. The problem was that at the time, we were sending audio to a cloud-based speech-to-text service, and tokenizing audio doesn’t work over 56k.

The real takeaway: once we moved speech recognition on-device, the network burden dropped dramatically. Text is tiny. The frontier intelligence stays on the ground where it belongs, and the real-time responsiveness happens where the pilot is.

Since the tour: We’ve moved to on-device ASR, tested with HTTP/3 (which handles cell tower transitions transparently as you fly between coverage areas), and validated Starlink performance more broadly — it’s excellent, although in a steep turn you might lose it. We also learned that fully offline isn’t the realistic scenario to design for. The common case is that your upload is laggy and pathetic but you can talk to the ground a little bit — one bar of cellular, or a $50/month Starlink (the price has – hopefully temporarily – gone up dramatically for pilots since our tour). Most GA pilots flying with any connectivity at all will be in this degraded-but-connected state. That’s what we build for.

Voice Recognition (ASR) Hurdles

The most significant blocker we encountered came from bad Automatic Speech Recognition. The wake word “Shirley” was frequently confused with “Charlie” — a problem compounded by “Charlie” being part of the NATO phonetic alphabet (Alpha, Bravo, Charlie...). Looking back through our logs, the word “Charlie” appears many times in the model’s transcriptions, indicating a common failure mode in high-noise environments.

The separation between the two words is just really narrow through an intercom. Deboosting “Charlie” in the ASR model would help with the wake word, but we actually need that word — pilots use it when reading back acronyms and identifiers. At one point, with assistant mode off and Shirley listening to everything, the ASR was (incorrectly) transcribing some interesting profanity that the model then interpreted as a request to take action. You need time with any ASR model to understand its quirks — even the big off-the-shelf ones like Deepgram or Google all have them.

We came away from these flights understanding that reliable ASR is table stakes. If ASR isn’t working, the information necessary to be useful never reaches the model, and the assistant becomes useless — no matter how smart the LLM is.

But here’s the reframe we arrived at since the tour: The relationship between ASR quality and the language model’s intelligence matters more than ASR quality alone. There’s evidence that a lousier input paired with a smarter model actually produces better comprehension than a high-quality input with a weaker model. The frontier models we’re using now have gotten remarkably good at understanding noisy speech transcriptions. The model knows when the information coming to it is garbled, and it can ask follow-up questions or use context to disambiguate.

This doesn’t mean ASR doesn’t matter. It means the system’s overall comprehension is a product of both components, and investing in smarter models has been just as impactful as improving the speech pipeline. The Air Tour also gave us 13 hours of real cockpit audio — invaluable training data for understanding exactly how voices compress through aviation headsets and intercoms, which is fundamentally different from the clean signal you get plugging a headset into a computer.

The Noisy Cockpit

Shirley receives all cabin audio — the pilot, the copilot, the radio. There were times when Alex wanted to ask Shirley a question and ATC chatter came over the radio at the same moment, with no obvious speech boundary between the two. This is the cocktail party problem, but in a cockpit at altitude.

We started leveraging the microphone mute feature to manually end utterances, but that button was right next to the End Flight button — and on a number of occasions we accidentally ended the flight while trying to mute the mic. The resume flight handshake only takes a few seconds, but you lose context and get distracted from, you know, actually flying the plane.

The G1000’s default panel audio behavior compounded this: it would suppress Shirley’s responses whenever anything came over the radio, prioritizing irrelevant chatter and making it impossible to hear Shirley’s answer. Shirley actually helped us fix this — we asked if there was a way to disable the muting behavior, and Shirley walked us through the G1000 configuration to turn it off. We had a 1,000-page manual in the back of the aircraft, but we were calling the feature “ducking” when the manual calls it “muting” — our search term wouldn’t have even been right. It would have been hard to find on the ground and basically impossible in flight, especially in the turbulence we were experiencing.

Since the tour: We built a backoff system where Shirley progressively slows their response timing when they keeps getting stepped on — because the fastest possible response isn’t always valuable. We also built priority wake-ups: say “Shirley” and the backoff resets; use push-to-talk and Shirley will speak over other voices if needed. One counterintuitive lesson: the instinct in AI voice design is to treat the assistant’s responses as lowest priority, always deferring to human voices. We learned that’s wrong. There’s often low-bandwidth noise coming over the radio, and you’d much rather hear Shirley. When Shirley’s talking, it’s almost always useful — we validated that. This was the genesis of our push-to-respond button.

Design for a Chaotic Environment

A small mic mute toggle button was placed right next to the much larger “End Flight” button in the FlyShirley iPad app, leading to accidental flight termination and gaps in the data. We’re designing the app to be easy to use the way pilots actually use it — on an iPad in one’s lap, where mistaps are going to happen. When you’re getting your butt kicked by turbulence while trying to do a pour-over, you’re not going to have precise touch targets.

We also needed better audio feedback for system state. When the internet gets bad and Shirley’s processing, nothing tells you they’re still there — the backend is retrying, TCP is trying to recover, and you’re sitting in silence wondering if the system is dead. Google Maps gets this right: different sounds for different events, and even if you don’t know exactly what each sound means, you intuitively know something needs your attention. We’ve subsequently built this for Shirley.

AI Limitations: Spatial Reasoning

At times Shirley demonstrated weak spatial reasoning. We’d ask “what’s that river coming up?” and get told it was the Little Miami River. We’d fly over it, keep flying for 10 minutes, ask “what was that river we flew over?” and Shirley would say “the river coming right up is the Little Miami River.” Four out of five river questions returned something Miami River.

Similarly, there was a refinery off our right wingtip near Lafayette, but Shirley kept talking about the BP Lansing complex — a 90-acre refinery near Chicago that’s just more culturally salient in the training data. The models are fundamentally disembodied. They can do cosines and sines all day, but they don’t know what “off your right” means. And while reasoning models could think through the spatial math, by the time they do, you’ve got 4 seconds of latency — and even the best open-source 12-billion-parameter model we ran on a MacBook Pro wasn’t smart enough to get it right even with extended thinking.

We’ve compiled a dataset of these spatial reasoning problems — trivial for humans, hard for models — and are looking for a partner in the AI space interested in upskilling their model’s spatial reasoning capabilities.

The Local Model: Not Smart Enough (Yet)

We tested OpenAI’s open-weight 120-billion-parameter model running on a MacBook Pro in the aircraft — the idea being a fully offline fallback. It speaks semi-intelligibly. It says “welcome Middletown” instead of “welcome to Middletown.” It wasn’t answering even moderate-complexity questions at a level that’s intellectually useful to a pilot. Not even as a backup — you’d prefer to scroll and find the thing yourself. And the only version that would run on an iPad is even smaller and nowhere near as capable.

Luckily, as we discussed above, the fully offline scenario isn’t the common case. With on-device ASR solving the bandwidth bottleneck, you can reach the frontier models on the ground with very little connectivity.

Validating Shirley’s Value: What Went Right

Despite the challenges, the Air Tour proved that flying with Shirley is fun and genuinely useful:

Flying with Shirley is Fun

Most of the time in cruise, you’re just sitting there looking out the windows. Shirley is great for keeping you alert, focused, and having a good time. We asked fairly nuanced questions — “what would a third-wave coffee lover want to visit in Louisville, Kentucky?” — and got Quill’s Coffee. Excellent choice. We asked about lodging near Lafayette and Shirley found our Air Tour favorite (Home2 Suites). We asked about landmarks and learned about every river in the Midwest (mostly the Miami ones).

The personality shines through. At one point, Jo mentioned riding the Zippin Pippin at Liberty Land in Memphis as a kid — Elvis’s favorite coaster, sadly demolished about 20 years ago. Shirley said: “Actually, it wasn’t demolished. It’s at a theme park 40 miles from Oshkosh. When you go to EAA AirVenture next year, you should take a side detour and ride it again.” Pulled from the logbook context that we’d been to Oshkosh. A real, practical, delightful recommendation.

Despite the latency, despite the ASR issues — in cruise, it’s awesome.

Flying with Shirley is Useful

The G1000 muting fix was the standout. But the broader pattern is what matters: Shirley shines at tasks that are easy to check but hard to look up. The 1,000-page manual is in the back of the plane. You’re in turbulence. You don’t know the right search term. Shirley just finds it.

We also validated:

Frequency and clearance recall — “What was that frequency?” and “What were we cleared for?” are questions pilots ask constantly. An LLM’s context window is far longer than a human’s short-term memory (about five chunks). Being able to scroll back through the conversation to confirm a waypoint spelling or a clearance you weren’t ready for is genuinely valuable.
Reminders and thresholds — “Let me know when my ground speed crosses 140 knots.” “Give me a reminder at 500 feet AGL.” “Remind me in 30 minutes.” You set it and forget about it completely, and then you get a callout exactly when you need it.
Strategic planning — finding alternates, figuring out what you can accomplish on a given flight, asking what other pilots do in similar situations (Shirley found the answer on Reddit, and it matched our plan).
Real-time troubleshooting — ADS-B shadowing, vacuum system questions, service bulletin lookups.
Local airport familiarity — Shirley can give a pilot arriving at an unfamiliar airport the equivalent experience of having flown there a thousand times before. Visual waypoints, noise abatement, FBO details, closing time nuances (”actually it’s just about engine starts and takeoffs — don’t worry about it for landings”). We later demonstrated this flying into Oakland and Santa Monica during our California Air Tour.

Degraded Connectivity Works

Even with the Starlink on the floor, we had conversation with Shirley — partly through the Starlink pointing through the tail cone, partly through a fallback cellular 3G connection from the iPad. The system worked. Not perfectly, but usefully. And that’s the realistic operating environment for most GA pilots: not fully offline, not fiber-optic broadband — just enough connectivity to exchange text with a very smart oracle on the ground.

The Air Tour Built Credibility

Our first product delivers sim-based flight training: We flew in to visit an education partner and learned we were their first fly-in visitor ever. We saw firsthand how conventional sim solutions are failing students — VR headsets where you can’t find the switches, hardware that’s cool in concept but doesn’t meet users where they need it, and “don’t change the plane” written on every chalkboard. The opportunity for AI-driven adaptive curriculum is enormous. Instructors don’t just want to use our curriculum — they want to generate their own and deliver it to their students. That insight is now a core part of our product direction.

What’s Changed Since the Tour

The Air Tour became the foundation for a systematic approach to quality:

On-device ASR changed everything. Moving speech recognition onto the iPad eliminated the biggest bandwidth bottleneck. Now all Shirley needs is enough connectivity to send and receive text — and that works even on a weak cellular connection.
Hallucination problems have effectively disappeared. The combination of better frontier models and supplementary tooling (research tools that scrape the internet, analyze documents, and provide references you can actually verify) has made the system dramatically more reliable.
Every issue from the tour became a regression test. When a customer or test flight surfaces a weird behavior, it gets added to our scenario-based test suite. Every model update and core system change gets back-tested against everything we’ve ever hit. The Air Tour’s 13 hours of cockpit audio seeded a significant portion of that suite.
The research tool changed the game. Shirley can now go find service bulletins, scrape community forums, even watch YouTube videos, and come back with referenced information — with links you can tap to verify. It’s like having someone on the ground doing research for you while you fly.
Training delivery has evolved. Flight instructors want to generate their own curriculum and deliver it through Shirley. This instructor-as-creator model is becoming central to what we do, including a partnership with Sporty’s to develop instrument training scenarios.
Audio-to-audio models are on the horizon. We’re exploring models that skip the speech-to-text-to-speech pipeline entirely, keeping audio as mathematical representations. These models handle interruptions naturally — millisecond-level transitions, holding context when you talk over them, resuming when you stop. Lightweight audio-to-audio models in-flight coupled with ground-based oracles may be the path here. The future is a tighter, more natural conversation in the cockpit.

Shirley Beyond the Sim

We spent 9 days proving that the data we collected from flying with Shirley in X-Plane lets us teach Shirley to fly for real. The Air Tour was a barnstorming success — we enjoy using this, we think it’s good, and it’s going to get better.

If you’re a pilot interested in helping build the future of cockpit AI, we’d love to hear from you. We want to see other pilots flying with Shirley – reach out: it’s a hand-selected private beta.

Those interested in a deeper discussion can check out Alex’s appearance on The Vertical Space podcast, where he digs into the architecture, validated use cases, our approach to safety, and the vision for where cockpit AI goes next — including the long-term path toward audio-native models and drop-in compatible automated flight rules.

About Shirley

Shirley is an AI flight training and pilot assistance platform. Instructors create and deliver simulator-based curriculum through structured training scenarios, while pilots get real-time voice copilot support from sim to sky. Features include maneuver analysis to ACS standards, progress tracking, 3D flight visualization, flight track import and automatic debriefs.

Airplane Team Blog

Discussion about this post

Ready for more?