An efficient activity recognition model by integrating object recognition and image captioning with deep learning techniques for the visually impaired

KİLİMCİ, ZEYNEP; KÜÇÜKMANİSA, AYHAN

doi:10.17341/gazimmfd.1245400

An efficient activity recognition model by integrating object recognition and image captioning with deep learning techniques for the visually impaired

JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, cilt.39, sa.4, ss.2177-2186, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 39 Sayı: 4
Basım Tarihi: 2024
Doi Numarası: 10.17341/gazimmfd.1245400
Dergi Adı: JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Art Source, Compendex, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.2177-2186
Anahtar Kelimeler: Activity recognition, deep learning models, feature injection techniques, image caption generator, long short-term memory networks
Kocaeli Üniversitesi Adresli: Evet

Özet

Automatically identifying the content of an image is a core task in artificial intelligence that connectscomputer vision and natural language processing. This study presents a generative model based on a deepand recurrent architecture, combining the latest developments in computer vision and machine translation,to create natural sentences describing an image. With this model, the texts obtained from the images can be converted into audio file format, and the activity of the objects around the person can be defined for visuallyimpaired people. For this purpose, first, object recognition is performed on images with the YOLO model,which identifies the presence, location and type of one or more objects in a particular image. Next, long-short-term memory networks (LSTM) are trained to maximize the probability of the target statementsentence given the training image. Thus, the activities in the related image have been converted to text formatas annotations. The activities, which are converted to text format, are obtained using the Google text-to-speech platform, and the audio file describing the activity is obtained. Flickr8K, Flickr30K and MSCOCO datasets are employed to evaluate four different features injection architectures to demonstrate theeffectiveness of the proposed model. The experimental results show that our proposed model successfullyexpresses the activity description audibly for visually impaired individuals.