Delay-Resistant Geo-Distributed Analytics
Big data analytics platforms have played a critical role in the unprecedented success of data-driven applications. However, real-time and streaming data applications, and recent legislation, e.g., GDPR in Europe, have posed constraints on exchanging and analyzing data, especially personal data, across geographic regions. To address such constraints data has to be processed and analyzed in-situ and aggregated results have to be exchanged among the different sites for further processing. This introduces additional network delays due to the geographic distribution of the sites and potentially affecting the performance of analytics platforms that are designed to operate in datacenters with low network delays. In this paper, we show that the three most popular big data analytics systems (Apache Storm, Apache Spark, and Apache Flink) fail to tolerate round-trip times more than 30 milliseconds even when the input data rate is low. The execution time of distributed big data analytics tasks degrades substantially after this threshold, and some of the systems are more sensitive than others. A closer examination and understanding of the design of these systems show that there is no winner in all wide-area settings. However, we show that it is possible to improve the performance of all these popular big data analytics systems significantly amid even transcontinental delays (where inter-node delay is more than 30 milliseconds) and achieve performance comparable to this within a datacenter for the same load.
Published in: IEEE Transactions on Network and Service Management, 10.1109/tnsm.2022.3192710, IEEE