eprintid: 13652
rev_number: 2
eprint_status: archive
userid: 1
dir: disk0/00/01/36/52
datestamp: 2023-11-10 03:28:13
lastmod: 2023-11-10 03:28:13
status_changed: 2023-11-10 01:51:41
type: article
metadata_visibility: show
creators_name: Kumar, G.
creators_name: Basri, S.
creators_name: Imam, A.A.
creators_name: Balogun, A.O.
title: Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model
ispublished: pub
keywords: Computational methods; Intelligent systems; Natural language processing systems; Recurrent neural networks; Software engineering; Syntactics, Conceptual model; Data harmonization; Heterogeneous datasets; Information format; NAtural language processing; Parts of speech; Training and testing; Unstructured data, Large dataset
note: cited By 4; Conference of 4th Computational Methods in Systems and Software, CoMeSySo 2020 ; Conference Date: 14 October 2020 Through 17 October 2020; Conference Code:253159
abstract: Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80 of todayâ��s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make efficient decisions on run time and also to store heterogeneous nature of data by existing tools. Data Harmonization can be used to solve the heterogeneity problem; the idea of data harmonization is to provide a uniform representation and remove all forms of heterogeneity from the heterogeneous datasets. In recent studies, various models have been developed for integrating, mapping, and fusion of structured and semistructured datasets, but no such model has been developed for structured, semistructured, and unstructured datasets. Information extraction is used as a vital component to extract data from different textual datasets that information formats may comprise in different file formats, i.e., Excel, JSON, and text. For developing textual data harmonization model for heterogeneous datasets, comprises of structured, semistructured, and unstructured data based on phrases similarity techniques, it needs to be first preprocessed using Natural Language Processing and its techniques like Bag of Phrases, Parts of Speech and so on. Therefore this paper focuses on the conceptual data harmonization model based on text similarity technique, which will help to blend structured, semistructured, and unstructured data. The selected phrases from heterogeneous datasets will go through training and testing using Recurrent Neural Network. Â© 2020, The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG.
date: 2020
publisher: Springer Science and Business Media Deutschland GmbH
official_url: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85098178362&doi=10.1007%2f978-3-030-63322-6_61&partnerID=40&md5=b54cf25cfd96e3825e192f3b37d975b9
id_number: 10.1007/978-3-030-63322-6₆₁
full_text_status: none
publication: Advances in Intelligent Systems and Computing
volume: 1294
pagerange: 723-734
refereed: TRUE
isbn: 9783030633219
issn: 21945357
citation:   Kumar, G. and Basri, S. and Imam, A.A. and Balogun, A.O.  (2020) Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model.  Advances in Intelligent Systems and Computing, 1294.  pp. 723-734.  ISSN 21945357