eprintid: 20206 rev_number: 2 eprint_status: archive userid: 1 dir: disk0/00/02/02/06 datestamp: 2024-06-04 14:19:56 lastmod: 2024-06-04 14:19:56 status_changed: 2024-06-04 14:16:51 type: article metadata_visibility: show creators_name: Kehkashan, T. creators_name: Alsaeedi, A. creators_name: Yafooz, W.M.S. creators_name: Ismail, N.A. creators_name: Al-Dhaqm, A. title: Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature Review ispublished: pub keywords: Extraction; Human computer interaction; Learning systems; Long short-term memory, Combinatorial analysis; Deep learning; Features extraction; Machine-learning; Performance evaluation metrics; Performances evaluation; Systematic; Systematic literature review; Video analysis, Feature extraction note: cited By 0 abstract: Recent improvements formulated in the area of video captioning have brought rapid revolutions in its methods and the performance of its models. Machine learning and deep learning techniques are both employed in this regard. However, there is a lack of tracing the latest studies and their remarkable results. Although several studies have been proposed employing the ML and DL algorithms in different other areas, there is no systematic review utilizing the video captioning task. This study aims to examine, evaluate, and synthesize the primary studies into a thorough Systematic Literature Review (SLR) that provides a general overview of the methods used for video captioning. We performed the SLR to determine the research problems under which machine learning models were preferred over the deep learning models and vice versa. We collected a total of 1,656 studies retrieved from four electronic databases; Scopus, WoS, IEEE Xplore, and ACM, based on our search string from which 162 published studies passed the selection criteria related to one primary and two secondary research questions after a systematic process. Moreover, insufficient data collection and inefficient comparison of results are common issues identified during the review process. We conclude that the 2D/3D CNN for video feature extraction and LSTM for caption generation, METEOR and BLEU performance evaluation tools, and MSVD dataset are most frequently employed for video captioning. Our study is the pioneer in comparing the implementation of ML and DL algorithms employing the video captioning area. Thus, our study will accelerate the critical assessment of the state-of-the-art in other research fields of video analysis and human-computer interaction. © 2013 IEEE. date: 2024 publisher: Institute of Electrical and Electronics Engineers Inc. official_url: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85184008341&doi=10.1109%2fACCESS.2024.3357980&partnerID=40&md5=0131e17ffc13fbfedb715b273e359da6 id_number: 10.1109/ACCESS.2024.3357980 full_text_status: none publication: IEEE Access volume: 12 pagerange: 35048-35080 refereed: TRUE issn: 21693536 citation: Kehkashan, T. and Alsaeedi, A. and Yafooz, W.M.S. and Ismail, N.A. and Al-Dhaqm, A. (2024) Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature Review. IEEE Access, 12. pp. 35048-35080. ISSN 21693536