An efficient activity recognition model by integrating object recognition and image captioning with deep learning techniques for the visually impaired Görme engelliler için nesne tanıma ve resim altyazısını derin öğrenme teknikleriyle entegre eden verimli bir aktivite tanıma modeli


Creative Commons License

KİLİMCİ Z. H., KÜÇÜKMANİSA A.

Journal of the Faculty of Engineering and Architecture of Gazi University, cilt.39, sa.4, ss.2177-2186, 2024 (SCI-Expanded) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 39 Sayı: 4
  • Basım Tarihi: 2024
  • Doi Numarası: 10.17341/gazimmfd.1245400
  • Dergi Adı: Journal of the Faculty of Engineering and Architecture of Gazi University
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Art Source, Compendex, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.2177-2186
  • Anahtar Kelimeler: Activity recognition, deep learning models, feature injection techniques, image caption generator, long short-term memory networks
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Automatically identifying the content of an image is a core task in artificial intelligence that connects computer vision and natural language processing. This study presents a generative model based on a deep and recurrent architecture, combining the latest developments in computer vision and machine translation, to create natural sentences describing an image. With this model, the texts obtained from the images can be converted into audio file format, and the activity of the objects around the person can be defined for visually impaired people. For this purpose, first, object recognition is performed on images with the YOLO model, which identifies the presence, location and type of one or more objects in a particular image. Next, long-short-term memory networks (LSTM) are trained to maximize the probability of the target statement sentence given the training image. Thus, the activities in the related image have been converted to text format as annotations. The activities, which are converted to text format, are obtained using the Google text-to-speech platform, and the audio file describing the activity is obtained. Flickr8K, Flickr30K and MSCOCO datasets are employed to evaluate four different features injection architectures to demonstrate the effectiveness of the proposed model. The experimental results show that our proposed model successfully expresses the activity description audibly for visually impaired individuals.