Abstract

The Problem: Hearing Loss Affects A Large, Underserved Population of 430 Million

Hearing loss is one of modern society’s most understated and overlooked clinical conditions. As per the WHO, over 430 million people around the world require rehabilitation to address “disabling hearing loss.” Another large segment of sign language users is ‘hearing nonverbal children.’ They are nonverbal due to conditions such as down syndrome, autism, cerebral palsy, trauma, and brain disorders or speech disorders.

Existing Solutions Are Not Scalable, Nor Cost-Efficient

Sign languages (e.g., the American Sign Language) are spoken by less than 2% of the hearing disabled. Also, its comprehension amongst the broader society, i.e., within the ‘hearing population’ is low
No common standards : There are more than 300 sign languages around the world, each with its own grammar and vocabularies. It's not possible to translate one sign language to another easily. (E.g., ASL to Chinese Sign Lang.)
Prosthetic devices (i.e., hearing aids) are expensive as regards low-income countries, and do not provide a solution for disabling hearing loss.
Consequences : Social isolation and depression amongst those with disabling hearing loss. Absence of effective communication has real-life consequences, e.g., during interactions with first responders.

My Machine Learning Solution Runs on a Smartphone, Is Free, Accessible in Any Language, & It Expands The Population Funnel With Whom Anyone Who Is Hearing Disabled Can Communicate to Include The Population of the Entire World.

My machine learning platform unifies complex, non-standardized gestures of 300 discrete sign languages into any spoken language on the planet, thus enlarging two-way communication exponentially.
The process involves training a machine learning model on a dataset of new symbols that can unify the 300 global sign languages. These are used to train a machine, which then recognizes these hand and body pose estimation and gestures, converts them to text, and then to spoken words in any language. Then, an animation engine reverses the process and converts spoken words into text or gestures, allowing for two-way communication.

Introduction

LSTM Models

The model itself is an LSTM (Long-Short Term Memory Model) that uses a Recurrent Neural Network (derived from Google’s Media pipe) to layer a mesh onto the hands and face of the user. It then tracks the mesh to determine motion and interprets the motion as text (which it knows how to do because of the training dataset). Then, as a second step, it takes this converted text and runs a text-to-speech algorithm to say it out aloud in the desired language.

Establishing The Underlying Logic And Principles Of Organization. The Long-short Term Memory

I chose to use an LSTM model for my project which is a type of a RNN (Recurrent Neural Network) because of the comparatively little training data they need. I’ve outlined the benefits of RNN architecture in the table below: LSTMs solve this problem by introducing a memory cell, which can store information over a prolonged period. The cell has gates that control the flow of information into and out of the cell, allowing the network to selectively retain or forget information as needed. This is important for recognizing a vast library of gestures whose frequency of repetition is low

Principles of Spatial Organization

The principal efforts are to train the computer to recognize spatial coordinates (x, y, z) of sign gestures that, while intending to be standardized, almost always involve recognition variations arising on account of users possessing a different way of “writing”.

To reduce these variations, the program is trained to recognize each gesture of this new, unifying language with increasing numbers of turns, until it reaches an acceptable level of accuracy

An extract of a sample run of my code is pasted below. These runs can run into thousands of lines to decipher iconography organized as follows: Deictic (Location or time), Motor (general gestures), Symbolic (representational), Iconic (understandable concepts), Metaphoric (convey through analogy).

Sample Echo Run

Sample Demo Run

Methodology

Hypothesis and Variables

Hypothesis

By using machine learning techniques, it is possible to create a system that can recognize sign language with a high degree of accuracy and translate it into any spoken or written language in real time. This system would use a database of signs (organized around deictic, motor, symbolic, iconic and metaphoric), and use machine learning algorithms to analyze and interpret the signs in real-time. The system would then generate spoken or written text that represents the meaning of the signs being used, allowing people who use sign language to communicate with those who do not understand it.

There are an optimal number of training epochs and runs-per-gesture for my translator system. (An epoch refers to a complete iteration through a dataset during training of a machine learning model.) My program functioned best at the dual equilibriums of 40 runs/gesture, and 2000 epochs/cycle, showing that this is the optimal for such projects. Adding more data only results in longer runtimes, while making no statistically significant enhancement in the levels of accuracy. (In some outlier results, an increase in the numbers of runs/gesture size caused reduced accuracy. This is an observation that needs greater scrutiny.)

Independent Variables

Number of epochs necessary to train the model on a two-gesture database.
Number of runs/gesture.

Dependent Variables

The time it takes to train the model.
The speed of recognition.
The accuracy of the recognition

Platform

This project was built in a Jupyter IDE, a specially designed system that is used to write code to perform ML and AI-related tasks.
In the working of the project, I integrated systems including Google Mediapipe (a facemesh system), and Tensorflow, an open-source library for Machine Learning.
On, average, the construction can be broken down into:
1. Using Mediapipe to generate facemeshes (to track personal motion)
2. Using Tensorflow to build an LSTM that can understand the data.
3. Recording nearly 100 gestures.
4. Training the system on such data

The Methodology: Training the Machine

A “Run” is defined as the single execution of the training process. I used 40 runs/case to train the machine and to arrive at a 97% degree of accuracy. These are sample images of sequential runs. At 40 runs/gesture, the desired level of accuracy was observed. Further runs only provided marginal gains (see graph above) once the graph began to plateau. The program can be trained with significantly higher runs/gesture to achieve higher levels of gains. An increase in the number of such gestures is a measure of the “vocabulary” of the program, and a measure of its ‘intelligence,’ and hence, the consequent precision and utility.

Results

This process is typically done using a specific algorithm and is often iterative, with the computer being trained on multiple rounds or epochs of the data similar to the ones displayed in the previous page. The quality and quantity of the data is crucial for the performance of the model, and it's also important to have a diverse and representative dataset. Both of these conditions will be met in the manner in which it is proposed to train the computer with 300 sign languages, each with their 100 initial gestures.

The principles of training a computer involve providing the computer with a large dataset of examples, along with corresponding outputs or labels. The computer then uses this data to learn patterns and relationships and adjust its internal parameters in order to improve its performance on new, unseen examples.

In the two graphs on the right, I have compared the accuracies of the various models, each of which was run with different numbers of epoch. The 2000 and 3000 models exhibit accuracies of 96 percent, and 93 respectively, while the 1000-epoch model has only 76%.

The graph in green compares the latencies of the different models. The result of this analysis is that the 2000 epoch model is ideal. The 2000 epoch model takes 2 seconds instead of the 3 seconds of the 1000 epoch model, and the 5 seconds of the 3000 epoch model.

Quality and Quantity of Data is Key

The quality and quantity of the data is crucial for the performance of the model, and it's also important to have a diverse and representative dataset. Both of these conditions will be met in the manner in which it is proposed to train the computer with 300 sign languages, each with their 100 initial gestures.

I trained the LSTM on a custom-collected dataset of over 900 videos, (30 gestures x 30 recordings/gesture), each of which I collected by hand. On these gestures, I ran the epochs (one epoch is one complete iteration over the training dataset, in which the system runs through the entire set of data, progressively updating its weights and biases), and the model returned a loss, and a categorical accuracy value for the results of each epoch.

The Results In this demonstration, I’m showing how a new shortcut can be used to convey a powerful message. By using a new gesture, and one which is not rooted in any existing sign language, the shortcut can be adopted by anyone around the world who has a hearing disability. I coded one small gesture to quickly say: “This is an emergency.”

The result of the gesture recognition is displayed in the blue ribbon. (I’m showing closed captioning in English. But it could be in any language.) In the next stage, I can add a voice synthesizer which can convert this text into any spoken language from the thousands spoken around the world. All of this is possible on a smartphone app.

Conclusion

The Technology

Epochs And Runs: Epochs and runs are related to the training process of machine learning models. An epoch is a complete iteration through all the training data, while a run is a single execution of the training process. Typically, a machine learning model is trained for multiple epochs, with each run consisting of multiple epochs. The number of epochs and runs can affect the performance of the model, with more epochs and runs leading to a better fit to the training data, but also increasing the risk of overfitting.

Conclusion and Future Applications

My project seeks to demonstrate how advanced AI capabilities can be leveraged to create a free, easy-to-use and low latency sign language translator that can translate gestures in sign language to text and speech. Furthermore, the model ran locally with no privacy or security risks - the data was never stored or transmitted. With these features, the system is ready for implementation in everyday life

The applications of such translative technology are near-limitless, being able to run conversions between multiple different sign-languages at will, adding closed captions to ASL videos to allow immediate understanding by global viewers, and more. Applications can range from the most basic (e.g., ordering a pizza) to something of critical importance, such as, to describe an emergency and/or to communicate effectively with a first responder, something that can have life and death consequences.

The main technology can support the creation of various affiliated technology-enabled platforms, such as:

1. In Educational Settings:

This technology can be used to make it easier for the ‘hearing disabled’ and the ‘hearing non-verbal’ children to be educated.
2. In Medical Environments:

This is probably one of the most important applications. Once a more normal conversation is possible between the hearing disabled and the doctor, a more accurate and nuanced diagnosis will be possible.
3. In the Workplace:

This can exponentially expand the number of jobs and professions in which the hearing disabled can participate. This will have a direct consequence on their quality of life.

Next Step:

Create an App and launch this as a free service.

Risk Mitigation

Nearly all alternative, existing systems that are built for object detection run on a cloud. I.e., on a system controlled by an external company, and where data breaches are common, and where privacy is lost. To remedy this, my entire system runs on the user’s device thus ensuring that there can be no violations of user-privacy, as data is neither transmitted nor stored. Furthermore:

Privacy

It will be possible to enable end-to-end encryption for all such conversations similar to that in any WhatsApp or Signal messaging, thus ensuring security.

Biometric

Images and videos that are generated as part of any conversation are neither recorded, nor transmitted.

The entire gesture-text-oral translation happens at the originating handheld device, and the only content which is transmission is of the spoken word, and which is encrypted.
This can address concerns of privacy, especially where children (hearing non-verbal) are concerned, and also open up clinical uses for this product

Future Experiments & Next Steps

Free to Use Mobile App

Exponentially expand access and reach globally
Create API’s to enable real-time communication during Zoom/Teams meetings or webinars by providing an intuitive and user-friendly interface for participants to interact using sign-language.

Portable deciphering of sign language:

It will be possible to enable in-goggle translation of sign language gestures without the need for a laptop or external webcam.

Biometric:

Retinal movement detection can account for differences in visual acuity among recipients of the translated communication.

Bibliography

“Deafness and Hearing Loss.” World Health Organization, World Health Organization, 1 Apr. 2021, https://www.who.int/news-room/fact-sheets.
Mitchel R.E., Young, T.A., Bachelda, B., Karchmer, M.A. (2006). How Many People Use ASL in the United States? Why Estimates Need Updating. Sign Language Studies, 6(3), 306–335. http://www.jstor.org/stable/26190621
Jiangbin Zheng, Zheng Zhao, Min Chen, Jing Chen, Chong Wu, Yijdong Chen, Xiaodong Shi, Yiqi Tong, "An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences", Computational Intelligence and Neuroscience, vol. 2020, Article ID 8816125, 11 pages, 2020. https://doi.org/10.1155/2020/8816125
Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign Language Recognition and Translation with Kinect (2013). Language Recognition and Translation with Kinect.pdf. http://vipl.ict.ac.cn/sites/default/files/papers/files/2013_FG_xjchai_Sign
Giulia Zanon de Castro, Rúbia Reis Guerra, Frederico Gadelha Guimarães, Automatic translation of sign language with multi-stream 3D CNN and generation of artificial depth maps, Expert Systems with Applications, Volume 215, 2023, 119394,ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2022.119394.
(https://www.sciencedirect.com/science/article/pii/S0957417422024125)

Testimonials

Deaf Can Foundation

Priti Soni

Breaking Communication Barriers with One Mudra

About One Mudra

Key Features

AI-Powered Automatic Grammar Correction

Real-Time Translation

Translates to more than 100 languages

Community & Custom training tool

Bi Directional Communication

User-Friendly Interface

How It Works