|dc.description.abstract||In the era of big data, with streaming applications such as social media, surveillance monitoring and real-time search generating large volumes of data, efficient Data Stream Processing Systems (DSPSs) have become essential. When designing an efficient DSPS, a number of challenges need to be considered including task allocation, scalability, fault tolerance, QoS, parallelism degree, and state management, among others.
In our research, we focus on task allocation as it has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs is represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for stream processing systems is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs.
In this thesis, we present our three proposed schedulers, each having a different heuristic partitioning approach to minimise inter-node communication for either homogeneous or heterogeneous clusters. We demonstrate how each scheduler can efficiently assign groups of highly communicating tasks to compute nodes. Our schedulers are able to outperform two state-of-the-art schedulers for three micro-benchmarks and two real-world applications, increasing throughput and reducing data processing latency as a result of a better task placement.||