Abstract:
Data mining is the process of knowledge discovery, attempts to discover useful information or patterns in large data repositories such as databases; that is why the data experts are interested in how data can be collected, stored, accessed and combined for the analysis to extract useful knowledge for the public including financial institutions and other sectors.
The Business Development Funds (BDF) aims to promote SMEs development through the provision of financial services to enhance the lending mechanism of financial institutions. As part of the financial infrastructure to promote SMEs, it was established with the objective of assisting SMEs to access finance with ease, particularly those without sufficient collateral to obtain credit from traditional financial institutions at reasonable rates. The BDF conducts a number of activities including guaranteeing loans for SMEs, and providing financial education services to SMEs in Rwanda. The SME sector, including formal and informal businesses, comprises 98% of the businesses in Rwanda and 41% of all private sector employment (Minicom, 2010; OECD, 2011).
In recent years, machine learning has become a popular field in big data analytics because of its success in learning complicated models. Methods such as decision tree, support vector machines, logistic regression and artificial neural networks can be used for recognizing patterns in the data (with a high degree of accuracy) that may not be apparent to human analysts, The reason why applications of data science using machine learning is important in such organisation and in all financial institutions.
Due to the advanced technology associated with big data, data availability and computing power, most financial or lending institutions are renewing their business models. Loan predictions, monitoring, model reliability and effective loan processing are key to decision-making and transparency. In this research, we will visualize data and build binary classifiers based on machine and deep learning models on real data in predicting loan default probability. The important features from these models are selected and then used in the modeling process to test the stability of classifiers by comparing their performance on separate data. After analysis and visualization of data, we used different models like decision tree, random forest, logistic regression and artificial neural networks to
make a real comparison of good predictors in this case.