The goal of this project is to understand and apply the various stages of a data analytics project (using the KDD or CRISP-DM methodology) in order to extract knowledge from data. You should also be able to appropriately evaluate the performance of the algorithms applied, as well as insightfully consider the results and limitations of your analyses.
In order to successfully complete the project, the following seven (7) steps should be followed.
1) Select at least three (3) related datasets appropriate to the FinTech domain. Each dataset should be suitably large (at least 10,000 rows and at least 10 columns).
Possible sources of datasets include, but are not limited to:
European Data Portal, EU Open Data Portal, and other: http://data.europa.eu/
UK’s open government data repository: http://data.gov.uk
Central Statistics Office, Ireland: http://www.cso.ie
Run My Code: http://www.runmycode.org/
Amazon’s public dataset repository: https://aws.amazon.com/datasets
Google’s Public Data Directory: http://www.google.com/publicdata/directory
The UCI machine learning repository: http://archive.ics.uci.edu/ml/
Google Data Search: https://toolbox.google.com/datasetsearch
2) Produce a Data Quality Report and perform any necessary pre-processing, e.g. transformation, imputation, feature engineering, etc. Perform some summary statistics on the data & describe what these statistics say about the data.
3) Implement four (4) different visualisations on the data. You can implement these on the individual datasets, a combination of the datasets or both.
4) Describe what these visualisations say about the data.
5) Implement four (4) different data mining algorithms. State why you have chosen these algorithms, and what you have found (i.e. knowledge extracted) using them.
6) Describe how well these algorithms perform using measures of performance including (but not limited to) 𝑅2, MSE, Accuracy, Precision, AUC, RMSE, F-measure and MAPE.
7) Write a report using the IEEE template. This document should not be more than eight (8) pages (including references). Papers over the page limit (even if it is only 1 word) will be subjected to a 5 percentile point penalty, i.e. the maximum mark for the paper will be 95%.
The report should contain the following sections:
Introduction (which contains your objectives and motivation)
Data Quality, Data Pre-processing and Summary Statistics
Data Mining Algorithms, Results and Performance Evaluation
Conclusions & Future Work