Journal of the Faculty of Engineering and Architecture of Gazi University, cilt.39, sa.4, ss.2177-2186, 2024 (SCI-Expanded)
Automatically identifying the content of an image is a core task in artificial intelligence that connects computer vision and natural language processing. This study presents a generative model based on a deep and recurrent architecture, combining the latest developments in computer vision and machine translation, to create natural sentences describing an image. With this model, the texts obtained from the images can be converted into audio file format, and the activity of the objects around the person can be defined for visually impaired people. For this purpose, first, object recognition is performed on images with the YOLO model, which identifies the presence, location and type of one or more objects in a particular image. Next, long-short-term memory networks (LSTM) are trained to maximize the probability of the target statement sentence given the training image. Thus, the activities in the related image have been converted to text format as annotations. The activities, which are converted to text format, are obtained using the Google text-to-speech platform, and the audio file describing the activity is obtained. Flickr8K, Flickr30K and MSCOCO datasets are employed to evaluate four different features injection architectures to demonstrate the effectiveness of the proposed model. The experimental results show that our proposed model successfully expresses the activity description audibly for visually impaired individuals.