ai-talk | My Site 2

Screen Shot 2023-03-29 at 10.18.30 PM.png

Project Overview

Our project aims to use AI/ML tools to assist individuals with hearing and speech impairment in a daily conversation scenario. As there are a lot of amazing models and toolkits available to use, we wanted to create something meaningful and helpful for the community. Our proposed system include three stages: speech-to-text conversion, gesture/pose recognition, and text-to-speech conversion. In the end, we integrated the three separate components into one webpage available to use.

Motivation

Problem Identified

Product Design | Web | Machine Learning

Gest-o-talk: building a web application using ML tools to facilitate conversation with hearing-impaired communities

Team

Design team of three

My Role

System design, User Journey Mapping, Pose Classification (PoseNet) using Teachable Machine, Web Design and Integration

Timeline

March - April 2023, 3 weeks

Tools

Web Speech API, PoseNet with Teachable Machine, HTML, CSS, JaveScript

Having a friend with hearing and speech problems has opened our eyes to the many difficulties that individuals with these challenges face on a daily basis. Seeing others who often struggles to communicate effectively with others, especially in noisy environments, and has expressed frustration at feeling left out of conversations or social events.

1.

Difficulty in Speech Perception

2.

Misunderstanding

3.

Social Isolation

Ideation & Co-Design

Problem Space -> User Needs --> Design System Breakdown

Why using Machine Learning Tools?

These AI/ML tools can be incredibly helpful in improving the communication and interaction between individuals with hearing difficulties and those without. By providing real-time transcription, gesture interpretation, and text feedback to audio, these tools can help bridge the gap between individuals with different communication abilities, making it easier for everyone to connect and communicate effectively.

Some Explorations

1. Speech-to-text

Whisper is a ML tool that provides speech-to-text translation and as well as transcription of live speech data to text. This model is mostly implemented using PyTorch and made by OpenAi.

However, we encountered some challenges in integrating this model into a web-based solution for our project. Whisper is built in Python and does not have a published API like other models, making it difficult to integrate into our solution. Additionally, the model has a size limit of 25 MB for input data, which is a challenge for long audio streams.

2. Pose Classification

For the second step, we have selected PoseNet in combination with Teachable Machine to explore the challenging problem of gesture and pose recognition. Our objective is to develop a model that can accurately interpret gestures and translate them into text or audio format, thus aiding communication between individuals with hearing and speech impairments and those who are not familiar with sign language.

3. Web Speech API

Web Speech API is a browser API designed for speech recognition and synthesis. With the earlier challenge we encountered with Whisper, we decided to use Web Speech API for both text-to-speech and speech-to-text communication.

Prototyping

Wireframe

Our proposed system is a three-step process. It envisions two people conversation scenario: one with hearing loss and speech deficit while the one operates as a normal person with hearing/speech capabilities.

Step 1 augments the speech and transcribes the audio input into a visual text format, which could be understood by the person with hearing issues (user 2).
Step 2 uses camera to take user 2's gesture/pose and classify the pose into a preset response.
Step 3 uses an audio to text translation that could output either - user 2's direct output/classified text into an audio output back to user 1.

Building

Teachable Machine & PoseNet

Demo of an integrated web service

Future Development

The proposed system design decisions for assisting individuals with hearing impairment during conversations using AI/ML tools have shown promising results, but there is still room for improvement.

To increase the accuracy and usability of the system, incorporating additional features such as sentiment analysis and a feedback system could offer more nuanced responses and continuously improve performance. Furthermore, to expand the functionality of the system, alternative inputs such as classification for supportive conversation and communication visualization using generative AI could be explored. In addition, sign language detection for real-time transcription and communication is an area of research that could be pursued further.

While the proposed system design has limitations, such as the need for users to open a web application during conversations, the potential benefits of integrating AI/ML tools to assist individuals with hearing problems are significant. By fine-tuning models on accurate data and addressing ethical implications, AI and machine learning tools can help solve real-world problems.

Credit to my team members:
Adit Verma, Ellie Huang and Jiamin Liu;
and the support from DF OCAD faculty and resources.

View All Projects

Project Overview

Motivation

Problem Identified

Product Design | Web | Machine Learning

Gest-o-talk: building a web application using ML tools to facilitate conversation with hearing-impaired communities

Team

Design team of three

My Role

System design, User Journey Mapping, Pose Classification (PoseNet) using Teachable Machine, Web Design and Integration

Timeline

March - April 2023, 3 weeks

Tools

Web Speech API, PoseNet with Teachable Machine, HTML, CSS, JaveScript

1.

Difficulty in Speech Perception​

2.

Misunderstanding

3.

Social Isolation

Ideation & Co-Design

Problem Space -> User Needs --> Design System Breakdown

Why using Machine Learning Tools?

Some Explorations

1. Speech-to-text

2. Pose Classification

3. Web Speech API

Web Speech API is a browser API designed for speech recognition and synthesis. With the earlier challenge we encountered with Whisper, we decided to use Web Speech API for both text-to-speech and speech-to-text communication.

Prototyping

Wireframe

Our proposed system is a three-step process. It envisions two people conversation scenario: one with hearing loss and speech deficit while the one operates as a normal person with hearing/speech capabilities.

Step 1 augments the speech and transcribes the audio input into a visual text format, which could be understood by the person with hearing issues (user 2).

Step 2 uses camera to take user 2's gesture/pose and classify the pose into a preset response.

Step 3 uses an audio to text translation that could output either - user 2's direct output/classified text into an audio output back to user 1.

Building

Teachable Machine & PoseNet

Demo of an integrated web service

Future Development

Credit to my team members: Adit Verma, Ellie Huang and Jiamin Liu; and the support from DF OCAD faculty and resources.

Difficulty in Speech Perception

Credit to my team members:
Adit Verma, Ellie Huang and Jiamin Liu;
and the support from DF OCAD faculty and resources.