Abstract: Big data applications are among the most suitable applications to be executed on cluster resources because of their high requirements of computational power and data storage. Correctly sizing the resources devoted to their execution does not guarantee they will be executed as expected. Nevertheless, their execution can be affected by perturbations which can change the expected execution time. Identifying when these types of issue occurred by comparing their actual execution time with the expected one is mandatory to identify potentially
critical situations and to take the appropriate steps to prevent them. To fulfill this objective, accurate estimates are necessary. In this paper, machine learning techniques coupled with a posteriori knowledge are exploited to build performance estimation models. Experimental results show how the models built with the proposed approach are able to outperform a reference state-of-the-art method (i.e., Ernest method), reducing in some scenarios the error from the 221.09-167.07% to 13.15-30.58%.