What If...? An Immersive Story

Lead gameplay engineer on interactive VR/AR episode of What If...?. Implemented cross-platform gesture recognition system

What If...? An Immersive Story

Lead gameplay engineer on interactive VR/AR episode of What If...?. Implemented cross-platform gesture recognition system

February 2023 - September 2024

ILM Immersive

About The Project:

Technology & Mediums:

Marvel’s What If...? An Immersive Story is an Emmy-winning, roughly 45-minute immersive experience designed for the Apple Vision Pro. As the lead gameplay engineer on the project, my responsibilities spanned every stage of development, from prototyping through to final implementation.

To power our gameplay interactions, I implemented a system that could rapidly recognize multi-step gestures and trigger corresponding effects. This required developing a custom plugin to bridge Objective-C native iOS ARKit data (hand and scene tracking) to our headless C++ Unreal Engine backend. Our gameplay systems used this plugin to control powers in the game and pipe those visuals through our RealityKit based rendering system. To streamline development across the studio, I also ensured the plugin supported Meta Quest headsets, enabling UX designers without access to Vision Pro hardware to iterate seamlessly.

This pipeline successfully led to the creation of eight unique gesture mechanics and associated gameplay. Throughout the development process, I guided a cross-functional team of engineers and designers in building upon the system and collaborated closely with writers and VFX artists to integrate powers into the narrative. Based on their feedback, I also developed custom C++ Unreal Engine Blueprint tools that significantly accelerated designer and VFX workflows.

SIGGRAPH Paper

Technology & Mediums:

February 2023 - September 2024

PROCESS:

Initial Prototyping and Evaluation on RealityKit

When development began, the Vision Pro system was brand new, untested, and, if the engineer in me is being honest, poorly documented technology. Our goal was to create a deeply immersive experience with engaging superpowers, which required an in-depth understanding of the headset's capabilities. I contributed to assessing its AR/VR features, hand tracking performance, and visual fidelity so we could determine what would be not just feasible but enjoyable. During this process, I developed a series of prototypes utilizing ARKit and RealityKit to test our ideas. These prototypes ultimately informed our final technology stack and gameplay concepts.

ARKit Data & Headless Unreal Engine System

Our initial evaluation revealed that while the Vision Pro's hand tracking and display hardware were exceptional, the native RealityKit engine lacked the advanced systems required to achieve our gameplay goals. To leverage a robust game development environment, we adopted a hybrid, headless Unreal Engine 5 (UE5) architecture. This decoupled framework ran all gameplay logic, physics, and state simulation on the UE5 backend while utilizing RealityKit and Metal for front-end spatial rendering.

Because our player abilities were triggered by physical gestures, the systems required real-time access to the Vision Pro’s advanced scene understanding—including player gaze vectors, spatial context, and high-frequency hand data. However, exposing this data to our simulation backend was a major hurdle, as UE5’s iOS infrastructure did not yet support the Vision Pro ecosystem. To bridge this gap, I engineered a custom pipeline that streamed critical data directly from ARKit into the Unreal Engine. This required directly modifying core, low-level iOS UE5 engine plugins to support non-standard, high-bandwidth data channels and translating the data into a format native to Unreal's gameplay systems.

Creating the Real-Time Gesture Recognition Plugin

Leveraging the raw hand-joint positional data, I architected a core gesture-recognition system and packaged it into a reusable C++ plugin. The system defined static poses by calculating the spatial position and orientation of individual joints relative to their parent joints. For real-time pose recognition, live joint data streams were compared against a matrix of reference poses to evaluate mathematical proximity.

I also designed the plugin to support contextual constraints, such as hand orientation relative to the player's head or the floor, and specific minimum/maximum distances between joints. To ensure robustness across diverse users, I implemented a data-filtering and tolerance layer. Based on QA feedback, this layer allowed us to ignore minor joint variations or apply a broad range of acceptable angles—such as relaxing the required flexion of the thumb during a "fist" pose to prevent false negatives.

[Static Pose Definition] ➔ [Contextual Constraints] ➔ [Tolerance Filtering] ➔ [Multi-Step Sequence Matrix]



(Joint Angles & Proximity)    (Head/Floor Orientation)    (Ignore Minor Variations)   (Time-Window Transition)

Once individual poses were stabilized, I structured multi-step gestures as a deterministic sequence of states managed within a defined temporal transition window. This framework allowed us to smoothly translate simple poses into complex interaction triggers:

"Snap" (Static): Middle and thumb fingertips intersecting, with the palm oriented upward.
"Shield" (Static): A closed fist directly facing the player's headset display.
"Blast" (Multi-Step): An arm outstretched with an open palm, transitioning rapidly to a closed fist targeted at an actor.
"Spell-Drawing" (Multi-Step): Symmetrical pinching motions tracing a geometric sigil, completed by a high-velocity hand clap.

Iterative UX Design and QA Refinement

The modular architecture of the gesture plugin enabled extensive interaction prototyping, allowing us to quickly iterate on eight distinct player abilities to ensure they felt tactile and responsive. Close collaboration with our QA teams was essential to account for the wide variance in how different users naturally perform physical actions.

For the "snap" mechanic, we paired explicit in-experience tutorialization with highly relaxed tolerances for the non-essential fingers. A major spatial computing challenge arose when a user's fingers became occluded from the tracking cameras during high-velocity movements, such as the "blast" gesture. To solve this, I chose to prioritize state-machine flexibility over strict pose consistency; our testing proved that maintaining continuous visual feedback was far less jarring to the player's immersion than a stuttering, drop-out effect caused by temporary loss of tracking.

VFX Integration and Performance Optimization

I partnered closely with our VFX artists to tie visual feedback directly to the gesture-recognition state machine, ensuring that power activation, charging states, and environmental impacts synchronized perfectly with player intent.

From an engineering perspective, minimizing latency was a critical performance bottleneck. Positional data had to traverse the entire hardware and software loop: captured via ARKit, streamed to the headless UE5 instance for gameplay simulation, and sent back across the OS layer to RealityKit for final rendering. To hit our performance targets, I executed multiple profiling and optimization passes on the joint-tracking and pose-evaluation loops, minimizing per-frame calculations to ensure the interaction pipeline maintained a deterministic 30+ fps execution thread.

Technical Retrospective and Accessibility

Developing a launch title on unreleased hardware required rapid technical adaptation across highly unfamiliar systems. Reflecting on the production architecture, a key area for future improvement is a more robust approach to physical accessibility. Due to strict timelines, we were unable to implement a dynamic, individual range-of-motion calibration sequence. Ideally, a calibration phase during the tutorial would prompt users to fully open and flex their hands, creating a normalized scalar for subsequent pose definitions.

To mitigate this limitation, I implemented an alternative accessibility mode that registered a simplified, low-strain outstretched hand pose as a generic trigger for all gesture interactions. Ultimately, this project reinforced the vital importance of building flexible developer tools; our iterative passes on the internal C++ Blueprint interfaces significantly lowered the technical barrier for our creative team, enabling designers to tune and balance the experience independently.