Multimodal Screening Framework

Three signals.
One fused read on wellbeing.

Voice, spoken words, and facial expressions often convey complementary data streams. This network fuses three perception architectures into a single aligned channel to identify behavior indicators.

// audio · text · face → fused signal

Not a diagnostic tool. This system is a research screening aid built on verified research dataset distributions. It does not diagnose any medical condition and should never replace formal evaluations by certified healthcare professionals.

01 — Live Evaluation

Process input video sequence.

Upload an evaluation sample to check the multi-stream system pipeline analysis breakdown.

🎥

Drop or choose a video file

Supported format: .mp4

voice

words

face

awaiting input

—

02 — Core Concept

Three specialized vectors, one trained referee.

Each feature array is captured through an isolated perception backbone. The cross-attention fusion block parses weights across the modalities dynamically based on contextual feature inputs.

VOICE

Speech Acoustic Features

Extracts tone, pitch, energy distribution, and temporal pacing from raw audio waveforms.

wav2vec2 Architecture

WORDS

Textual Feature Semantics

Processes raw speech transcripts to measure sentiment densities, linguistic markers, and sequence vectors.

Whisper ASR → MentalBERT

FACE

Visual Affect Mapping

Samples sequential image matrices over time to isolate visual micro-expressions and facial configurations.

Vision Transformer (ViT)

fused by a trained cross-attention head

03 — Architecture Mapping

Pipeline Dimensions & Vector Flow

audio.wav→wav2vec2 Backbone→256-d Vector Space

audio.wav→Whisper Processing Matrix→MentalBERT Mapping→256-d Vector Space

video.mp4→ViT Projection Frame Layer→256-d Vector Space

⟶ Cross-Attention Neural Head Fusion Matrix ⟶

Classification Array Output: [Low Risk / Moderate Risk / High Risk]

04 — System Metrics

Cross-Modality Benchmark Performance Report

Configuration	Accuracy	F1 (macro)
Audio Only Pipeline	0.69	0.66
Text Only Pipeline	0.46	0.43
Face Only Pipeline	0.61	0.59
Concat Fusion Matrix	0.78	0.76
Attention Fusion (Ours)	0.85	0.83

Three signals.One fused read on wellbeing.