Artificial Intelligence (AI) Hands-on Workshop

AI Applied in Financial Accounting

Unsupervised Machine Learning Applied in Real World Applications: Financial Close

Unsupervised Machine Learning: K-Means Clustering Implementation Guide

Accounting : Financial Close

Choose Your Hands-on Learning Topic

Disclaimer
Executive Summary
Business Problem
High Level Implementation Steps
Model Selection
Feature Engineering
Data Management
- What is an Input Data Set?
- What is a Test Data Set?
What is Learning Algorithm?
Machine Learning Libraries Used
Model Implementation Code Block
Model Implementation Steps
Conclusion
Appendix
Model Fitting in Machine Learning
Cross Validation
Hyperparameter Tuning

Disclaimer

All software & hardware's used or referenced in this guide belong to their respective vendors. We have developed this guide based on our development infrastructure and this guide may or may not work on other systems and technical infrastructure. We are not liable for any direct or indirect problems caused to the users using this guide.

Executive Summary

The purpose of this document is to provide adequate information to users to implement a K-Means Clustering model. In order to achieve this, we are using one of the most common financial problem that occurs in every company. The problem is solved using K-Means Clustering an Unsupervised machine learning model.

Business Problem

Problem Statement:

Identify the intercompany transaction which has difference in their booking - Identify transactions when a buyer and seller declares different transaction amount for the same transaction - incorrect transaction amount.

Business Challenges:

Takes more time
Different data sources, data formats and data frequency
Human intensive

Business Context:

It is a legal obligation for “Company A” to disclose their financials to the internal and/or external stakeholders of their company. During this process “Company A” collects financial data from all of its subsidiaries and then tries to find out intercompany transactions (ex: transactions that took place between a parent company and its subsidiary since these are not real financial transactions).

Traditionally, this process has been performed by “Company A” using various systems, several datasets and group of accounting experts and it has been a weeklong task.

“Company A” decided to use artificial intelligence to automate this process, in order to increase productivity and reduce the time taken to complete this task.

High Level Implementation Steps

Step 1: Defining a Clear Problem Statement

Step 2:Data Engineering - Import Training & Test Dataset

Step 3:Feature Engineering - Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Step 4: Model Selection - Model selection is the process of choosing between different machine learning approaches

Step 5: Model Implementation

Import the Required Libraries
Import the Input Data
Extract the Features
Convert Categorical Values to Numerical Values using Label Encoder
Train the Model
Review the Learning Algorithm
Import Test Data
Run the model on Test Data
Review the model outcome & write the model Outcome to a file. Open the file to Review the Outcome.

Model Selection

Model selection is the process of choosing between different machine learning approaches - e.g. SVM, logistic regression, K-Means etc. or choosing between different hyperparameters or sets of features for the same machine learning approach - e.g. deciding between the polynomial degrees/complexities for linear regression.

The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important than you'd think - there may be a "best" algorithm for a particular problem, but often its performance is not much better than other well-performing approaches for that problem.

There may be certain qualities you look for in a model:

Interpretable - can we see or understand why the model is making the decisions it makes?
Simple - easy to explain and understand
Accurate
Fast (to train and test)
Scalable (it can be applied to a large dataset)

Our Problem here is an Unsupervised Classification Problem. The Problem is to Identify & pair intercompany transactions that have difference in their bookings. This Type of Problem can be Solved by the following Models.

K-Means Clustering
Hierarchical Clustering

We are going to use K-Means Clustering as the Dataset does not contain any label Information. This is the what we do when there is un labelled data that needs to get the labels. After the Labels are generated Test Dataset is used to Predict what we want to achieve.

Feature Engineering

Feature Engineering is the process of using industry knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is an art.
Feature engineering is the most important step in machine learning that creates the huge difference between a good model and a bad model.

Advantages of Feature Engineering

Good features provide you with the flexibility of choosing an algorithm; even if you choose a less complex model, you get good accuracy.
If you choose good features, then even simple ML algorithms do well.
Better features will lead you to better accuracy. You should spend more time on features engineering to generate the appropriate features for your dataset. If you derive the best and appropriate features, you have won most of the battle.

Data Management

There are three types of datasets Training, Test and Development that are used at various stages of this implementation. Training dataset is the largest of three of them, while test data functions as seal of approval and you don’t need to use till the end of the development.

What is an Input Data Set?

The input data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically.

The following section describes the data training data sets and its field level characteristics

Company
Company Code Country
Posting Date
Document Date
Currency
Trading Partner
Trading Partner Country
Transaction Type
Data Source
Data Category
Account
Transaction Amount
Inter Company Transaction Flag

What is a Test Data Set?

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

The following section describes the features that’s used in the model.

Company Code
Trading Partner
Trading Partner Country
Transaction Type
Data Category
Transaction Amount

What is Learning Algorithm?

A self-learning (not a human developed code) code, performs data analysis and extracts patterns (business characteristics) in data for business application development - a modern approach to application/software development.
Automatically understands and extracts data pattern a modern approach (change in business circumstance) and performs data analysis based on the new/changed data. - No code change required to implement changes that took place in the data (change in business)

Machine Learning Libraries Used

Sklearn 0.19.0 (Scikit Learn)
Pandas 0.20.3

Clustering Model Used

K-Means Clustering

Model Building Blocks

There are several technical and functional components involved in implementing this model. Here are the key building blocks to implement the model.

Model Implementation Steps

AI model implementation, to address a given problem involves several steps. Here are the key steps that are involved to implement the model. You can customize these steps as needed and these steps are developed for participant learning purpose only.

Model Implementation Code Block

# Step 1- Import the Required Libraries
import pandas as vAR_pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
# Step 2- Import Clustering Data
vAR_df = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Train_Data)
# Step 3 - Convert Categorical Data into Numerical Values using Label Encoder
vAR_le = LabelEncoder()
vAR_Transaction_Type_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,7])
vAR_Transaction_Type_Conversion_df= vAR_pd.DataFrame(vAR_Transaction_Type_Conversion,columns={'Transaction_Type_Converted'})
vAR_Data_Category_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,9])
vAR_Data_Category_Conversion_df= vAR_pd.DataFrame(vAR_Data_Category_Conversion,columns={'Data_Category_Converted'})
vAR_Doc_Date_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,3])
vAR_Doc_Date_Conversion_df = vAR_pd.DataFrame(vAR_Doc_Date_Conversion,columns={'Doc_Date_Converted'})
# Attached the Converted Numerical Data to the main dataframe
vAR_df1 = vAR_df.merge(vAR_Transaction_Type_Conversion_df,left_index=True, right_index=True)
vAR_df2 = vAR_df1.merge(vAR_Data_Category_Conversion_df,left_index=True, right_index=True)
vAR_df3 = vAR_df2.merge(vAR_Doc_Date_Conversion_df,left_index=True, right_index=True)
vAR_df4 = vAR_pd.read_excel(vAR_Fetched_Data_Train_All_Features)
# Step 4 - Clustering the Data
vAR_model = KMeans(n_clusters=7,random_state=0)
vAR_model.fit(vAR_df4)
vAR_Class = vAR_model.labels_
vAR_Class_df = vAR_pd.DataFrame(vAR_Class,columns={'New_Group'})
vAR_Class_df
vAR_df4_Class_Merge = vAR_df4.merge(vAR_Class_df,left_index=True, right_index=True)
vAR_df4_Class_Merge
vAR_df5 = vAR_df4_Class_Merge.iloc[:,4]
# Step 5 - Review Learning Algorithm
vAR_model.predict(vAR_df4)
#Step 6 - Import Test Data
vAR_df6 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Test_Data)
#vAR_Data_Category_Conversion_test = vAR_le.fit_transform(vAR_df5.iloc[:,9])
#vAR_Data_Category_Conversion_test_df = vAR_pd.DataFrame(vAR_Data_Category_Conversion_test,columns={'Data_Category_Converted_test'})
vAR_df7 = vAR_pd.read_excel(vAR_Fetched_Data_Test_All_Features)
vAR_Features_test = vAR_df7
vAR_model1 = KMeans(n_clusters=4,random_state=0)
vAR_model1.fit(vAR_Features_test)
# Step 7 - Running Model on Test Data
vAR_Labels_Pred = vAR_model1.predict(vAR_Features_test)
# Step 8 - Review Model Outcome
vAR_Labels_Pred = vAR_pd.DataFrame(vAR_Labels_Pred,columns={'Predicted_Inter_Transaction_Pair'})
# Step 9 - Write Model Outcome to File
vAR_df8 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Train_Data)
vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)
vAR_df10 = vAR_df9.to_excel(vAR_Fetched_Data_Model_Path)
# Step 10 - To open and view the file outcome
vAR_df11 = vAR_pd.read_excel(vAR_Fetched_Data_Model_Path)
vAR_df11.head()

Model Implementation Steps

Step 0: Open Jupyter Notebook

Jupiter notebook is launched through the command prompt. Type cmd & Search to Open Command prompt Terminal.

Now, Type Jupiter notebook & press Enter as shown

After typing, the Below Page opens

Open a New File or New Program in Jupyter Notebook

To Open a New File, follow the Below Instructions

Go to New >>> Python [conda root]

Give a meaningful name to the File as shown below.

Step 1- Import Required Libraries Used

For our Model Implementation we need the Following Libraries:

Sklearn: Sklearn is the Machine Learning Library which contains numerous other libraries like numpy, scipyetc. which are used for numerical & scientific computations.

Pandas: Pandas is a library used for data manipulation and analysis. For Our Implementation we are using it for Importing the Data file & Creating Data frames (Stores the Data).

Step 2 - Import the Clustering Data

Next immediate step after importing all libraries is getting the Training data imported. We are importing the Clustering data stored in our local system with the use of Pandas library.

Step 3 - Feature & Label Selection

Step 3 of the Implementation is Feature Selection. Machine learning works on a simple rule – if you put garbage in, you will only get garbage to come out. By garbage here, we mean noise in data. This becomes even more important when the numbers of features are very large. We need only those features (Input) that are function of the Labels (Outputs). Ex: To Predict whether the given fruit is an apple or orange Color/Texture of the Fruit becomes a feature to be Considered. If the Color/Texture is Red then it an Apple, If it’s Orange its Orange.

The Features Selected must Numerical. If not, they have to be Converted to numerical values from categorical values. In our Scenario we use Label Encoder for the Conversion.

The Features Selected are Company, Document Date, Trading Partner, & Transaction Type. The Label is the Target Variable is the Clusters Predicted by the model.

Step 4 - Clustering the Data

Step 4 is Clustering the data meaning Making the model to group/cluster, understand & recognize the Pattern in the data. .

vAR_model.fit(vAR_df4) is to fit the model for Clustering.

Step 5 - Review the Learning Algorithm

As a next step we need to Review the Algorithm as to how it has learned from the Features we Provided as show

Step 6 - Import Test Data

Import the Test Data, this is the data used to test as to how the Model Performs

Step 7 - Running the Model on Test Data

Next We Test the Model with the Test Data as shown

Step 8 - Review the Model Outcome

Next We check the Output of Model i.e. the Prediction it has made on the test data

Step 9 - Write the Model to a File

Next We Write the Model Output to an excel file for analysis.

Step 10 - Open the File to View the Outcome

Open the Written File & Check the Outcome as Shown. Execute to View the data

Conclusion

In this lab work, we have used K-Means Clustering, an Unsupervised Learning model to identify the intercompany transaction which has difference in their booking. The model performed well on the test data & predicted the outcome as expected. For further data analysis and business decision the model outcome is stored on persistent storage - File.

This is a very basic implementation to learn and better understand the overall steps and processes that are involved in implementing an Unsupervised machine learning model. There are a lot more steps, processes, data and technologies involved. We strongly request and recommend you to learn more and prepare yourself to address real-world problems.

Appendix

Model Fitting in Machine Learning

Fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. A model that is well-fitted produces more accurate outcomes, a model that is overfitted matches the data too closely, and a model that is underfitted doesn’t match closely enough. Fitting is the essence of machine learning. If your model doesn’t fit your data correctly, the outcomes it produces will not be accurate enough to be useful for practical decision-making.

Types of Fitting

Regular Fitting
Over Fitting
Over Fitting

Best Fitting

The model is Best Fitting, when it performs well on training example & also performs well on unseen data. Ideally, the case when the model makes the predictions with 0 error, is said to have a best fit on the data. This situation is achievable at a spot between overfitting and underfitting. In order to understand it we will have to look at the performance of our model with the passage of time, while it is learning from training dataset.

Training Data Set (Example-1)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-1)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Best Fitting Model Code Block (Example-1)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_1)

print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-1)

Training Data Set (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-2)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Best Fitting Model Code Block (Example-2)

def Verify_Model_Bestfit():

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

]vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_2,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='bky')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_2)

print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-3)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Best Fitting Model Code Block (Example-3)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_3,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='gcg')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_3)

\print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-3)

Over Fitting

The model is Overfitting, when it performs well on training example but does not perform well on unseen data. It is often a result of an excessively complex model. It happens because the model is memorizing the relationship between the input example (often called X) and target variable (often called y) or, so unable to generalize the data well. Overfitting model predicts the target in the training data set very accurately.