top of page
Our project aims to use AI/ML tools to assist individuals with hearing and speech impairment in a daily conversation scenario. As there are a lot of amazing models and toolkits available to use, we wanted to create something meaningful and helpful for the community. Our proposed system include three stages: speech-to-text conversion, gesture/pose recognition, and text-to-speech conversion. In the end, we integrated the three separate components into one webpage available to use.
Product Design | Web | Machine Learning
Gest-o-talk: building a web application using ML tools to facilitate conversation with hearing-impaired communities
Design team of three
System design, User Journey Mapping, Pose Classification (PoseNet) using Teachable Machine, Web Design and Integration
March - April 2023, 3 weeks
Web Speech API, PoseNet with Teachable Machine, HTML, CSS, JaveScript
Having a friend with hearing and speech problems has opened our eyes to the many difficulties that individuals with these challenges face on a daily basis. Seeing others who often struggles to communicate effectively with others, especially in noisy environments, and has expressed frustration at feeling left out of conversations or social events.
Difficulty in Speech Perception
Ideation & Co-Design
Problem Space -> User Needs --> Design System Breakdown
Why using Machine Learning Tools?
These AI/ML tools can be incredibly helpful in improving the communication and interaction between individuals with hearing difficulties and those without. By providing real-time transcription, gesture interpretation, and text feedback to audio, these tools can help bridge the gap between individuals with different communication abilities, making it easier for everyone to connect and communicate effectively.
Whisper is a ML tool that provides speech-to-text translation and as well as transcription of live speech data to text. This model is mostly implemented using PyTorch and made by OpenAi.
However, we encountered some challenges in integrating this model into a web-based solution for our project. Whisper is built in Python and does not have a published API like other models, making it difficult to integrate into our solution. Additionally, the model has a size limit of 25 MB for input data, which is a challenge for long audio streams.
2. Pose Classification
For the second step, we have selected PoseNet in combination with Teachable Machine to explore the challenging problem of gesture and pose recognition. Our objective is to develop a model that can accurately interpret gestures and translate them into text or audio format, thus aiding communication between individuals with hearing and speech impairments and those who are not familiar with sign language.
3. Web Speech API
Web Speech API is a browser API designed for speech recognition and synthesis. With the earlier challenge we encountered with Whisper, we decided to use Web Speech API for both text-to-speech and speech-to-text communication.
Our proposed system is a three-step process. It envisions two people conversation scenario: one with hearing loss and speech deficit while the one operates as a normal person with hearing/speech capabilities.
Step 1 augments the speech and transcribes the audio input into a visual text format, which could be understood by the person with hearing issues (user 2).
Step 2 uses camera to take user 2's gesture/pose and classify the pose into a preset response.
Step 3 uses an audio to text translation that could output either - user 2's direct output/classified text into an audio output back to user 1.
Teachable Machine & PoseNet
Demo of an integrated web service
The proposed system design decisions for assisting individuals with hearing impairment during conversations using AI/ML tools have shown promising results, but there is still room for improvement.
To increase the accuracy and usability of the system, incorporating additional features such as sentiment analysis and a feedback system could offer more nuanced responses and continuously improve performance. Furthermore, to expand the functionality of the system, alternative inputs such as classification for supportive conversation and communication visualization using generative AI could be explored. In addition, sign language detection for real-time transcription and communication is an area of research that could be pursued further.
While the proposed system design has limitations, such as the need for users to open a web application during conversations, the potential benefits of integrating AI/ML tools to assist individuals with hearing problems are significant. By fine-tuning models on accurate data and addressing ethical implications, AI and machine learning tools can help solve real-world problems.
Credit to my team members:
Adit Verma, Ellie Huang and Jiamin Liu;
and the support from DF OCAD faculty and resources.
bottom of page