A Diffusion-Driven Multimodal Vision–Language Transformer with Spatio-Temporal Graph Attention and Cross-Lingual Semantic Alignment for Unified, Bidirectional Sign Language Translation and Generative Synthesis

FMDB Transactions on Sustainable Intelligent Networks Submit Paper

Archives
Vol. 3
- 2026 Vol.3 No.2
- 2026 Vol.3 No.1
Vol. 2
Vol. 1

Archiving

Fernando Martins De Bulhão is a member of

A Diffusion-Driven Multimodal Vision–Language Transformer with Spatio-Temporal Graph Attention and Cross-Lingual Semantic Alignment for Unified, Bidirectional Sign Language Translation and Generative Synthesis

Authors:
Edwin Shalom Soji, S. Silvia Priscila, B. M. Praveen

Addresses:
Department of Computer Science, Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, India. Institute of Engineering and Technology, Srinivas University, Dakshina Kannada, Karnataka, India.

Abstract:

The study proposes a new model of bidirectional sign language processing that addresses the gap between recognizing human gestures and producing synthetic signs. The system can capture detailed hand movements and facial expressions over time and space by learning a multimodal Vision-Language Transformer with Diffusion and adding Spatio-Temporal Graph Attention. It is based on the architecture of Cross-Lingual Semantic Alignment to make sure that the subtle grammar of sign language is properly remapped into the structures of spoken language. The specific dataset used in this study consists of 481 instances of data, with different signers and lighting conditions to ensure health. The main development tools are advanced deep learning libraries for manipulating tensors, skeletal models based on graph neural networks, and high-fidelity video synthesis using diffusion probabilistic models. Findings indicate that the model performs well in sign-to-text translation and in synthesizing realistic sign-language videos from textual data. This unified solution simplifies recognition and generation and enables inclusive communication without requiring distinct models. Graph attention focuses on small finger movements, whereas diffusion smoothes created sequences temporally, improving digital technology.

Keywords: Sign Language Translation; Diffusion Models; Graph Attention Networks; Multimodal Transformers; Generative Synthesis; Vision-Language Transformer; Textual Data.

Received on: 20/03/2025, Revised on: 29/05/2025, Accepted on: 05/08/2025, Published on: 03/01/2026

DOI: 10.69888/FTSIN.2026.000605

FMDB Transactions on Sustainable Intelligent Networks, 2026 Vol. 3 No. 1, Pages: 54–64

Views : 109
Downloads : 10

Download PDF

Navigation

Archives

Archiving

Fernando Martins De Bulhão is a member of

A Diffusion-Driven Multimodal Vision–Language Transformer with Spatio-Temporal Graph Attention and Cross-Lingual Semantic Alignment for Unified, Bidirectional Sign Language Translation and Generative Synthesis