Big data stream analysis: a systematic literature review
Any type of content formally published in an academic journal, usually following a peer-review process.
Recently, big data streams have become ubiquitous due to the fact that a number of applications generate a huge amount of data at a great velocity. This made it difficult for existing data mining tools, technologies, methods, and techniques to be applied directly on big data streams due to the inherent dynamic characteristics of big data. In this paper, a systematic review of big data streams analysis which employed a rigorous
and methodical approach to look at the trends of big data stream tools and technologies as well as methods and techniques employed in analysing big data streams. It provides a global view of big data stream tools and technologies and its comparisons. Three major databases, Scopus, ScienceDirect and EBSCO, which indexes journals and conferences that are promoted by entities such as IEEE, ACM, SpringerLink, and Elsevier were explored as data sources. Out of the initial 2295 papers that resulted from the first search string, 47 papers were found to be relevant to our research questions after implementing the inclusion and exclusion criteria. The study found that scalability, privacy and load balancing issues as well as empirical analysis of big data streams and technologies are still open for further research efforts. We also found that although, significant
research efforts have been directed to real-time analysis of big data stream not much attention has been given to the preprocessing stage of big data streams. Only a few big data streaming tools and technologies can do all of the batch, streaming, and iterative jobs; there seems to be no big data tool and technology that offers all the key features required for now and standard benchmark dataset for big data streaming analytics has not been widely adopted. In conclusion, it was recommended that research efforts should be geared towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data.