Unsupervised Machine Learning: K-Means Clustering Implementation Guide


Accounting : Financial Close

Unsupervised Machine Learning: K-Means Clustering Implementation Guide

Choose Your Hands-on Learning Topic

Disclaimer

All software & hardware's used or referenced in this guide belong to their respective vendors. We have developed this guide based on our development infrastructure and this guide may or may not work on other systems and technical infrastructure. We are not liable for any direct or indirect problems caused to the users using this guide.

Executive Summary

The purpose of this document is to provide adequate information to users to implement a K-Means Clustering model. In order to achieve this, we are using one of the most common financial problem that occurs in every company. The problem is solved using K-Means Clustering an Unsupervised machine learning model.

Business Problem

Problem Statement:

Identify the intercompany transaction which has difference in their booking - Identify transactions when a buyer and seller declares different transaction amount for the same transaction - incorrect transaction amount.

Business Challenges:

  • Takes more time
  • Different data sources, data formats and data frequency
  • Human intensive

Business Context:

It is a legal obligation for “Company A” to disclose their financials to the internal and/or external stakeholders of their company. During this process “Company A” collects financial data from all of its subsidiaries and then tries to find out intercompany transactions (ex: transactions that took place between a parent company and its subsidiary since these are not real financial transactions).

Traditionally, this process has been performed by “Company A” using various systems, several datasets and group of accounting experts and it has been a weeklong task.

“Company A” decided to use artificial intelligence to automate this process, in order to increase productivity and reduce the time taken to complete this task.

High Level Implementation Steps

Step 1: Defining a Clear Problem Statement

Step 2:Data Engineering - Import Training & Test Dataset

Step 3:Feature Engineering - Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Step 4: Model Selection - Model selection is the process of choosing between different machine learning approaches

Step 5: Model Implementation

  • Import the Required Libraries
  • Import the Input Data
  • Extract the Features
  • Convert Categorical Values to Numerical Values using Label Encoder
  • Train the Model
  • Review the Learning Algorithm
  • Import Test Data
  • Run the model on Test Data
  • Review the model outcome & write the model Outcome to a file. Open the file to Review the Outcome.
Model Selection

Model selection is the process of choosing between different machine learning approaches - e.g. SVM, logistic regression, K-Means etc. or choosing between different hyperparameters or sets of features for the same machine learning approach - e.g. deciding between the polynomial degrees/complexities for linear regression.

The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important than you'd think - there may be a "best" algorithm for a particular problem, but often its performance is not much better than other well-performing approaches for that problem.

There may be certain qualities you look for in a model:

  • Interpretable - can we see or understand why the model is making the decisions it makes?
  • Simple - easy to explain and understand
  • Accurate
  • Fast (to train and test)
  • Scalable (it can be applied to a large dataset)

Our Problem here is an Unsupervised Classification Problem. The Problem is to Identify & pair intercompany transactions that have difference in their bookings. This Type of Problem can be Solved by the following Models.

  • K-Means Clustering
  • Hierarchical Clustering

We are going to use K-Means Clustering as the Dataset does not contain any label Information. This is the what we do when there is un labelled data that needs to get the labels. After the Labels are generated Test Dataset is used to Predict what we want to achieve.

Feature Engineering

  • Feature Engineering is the process of using industry knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is an art.
  • Feature engineering is the most important step in machine learning that creates the huge difference between a good model and a bad model.

Advantages of Feature Engineering

  • Good features provide you with the flexibility of choosing an algorithm; even if you choose a less complex model, you get good accuracy.
  • If you choose good features, then even simple ML algorithms do well.
  • Better features will lead you to better accuracy. You should spend more time on features engineering to generate the appropriate features for your dataset. If you derive the best and appropriate features, you have won most of the battle.
Data Management

There are three types of datasets Training, Test and Development that are used at various stages of this implementation. Training dataset is the largest of three of them, while test data functions as seal of approval and you don’t need to use till the end of the development.

What is an Input Data Set?

The input data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically.

Input Data Set

The following section describes the data training data sets and its field level characteristics

  • Company
  • Company Code Country
  • Posting Date
  • Document Date
  • Currency
  • Trading Partner
  • Trading Partner Country
  • Transaction Type
  • Data Source
  • Data Category
  • Account
  • Transaction Amount
  • Inter Company Transaction Flag
What is a Test Data Set?

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set

The following section describes the features that’s used in the model.

  • Company Code
  • Trading Partner
  • Trading Partner Country
  • Transaction Type
  • Data Category
  • Transaction Amount
What is Learning Algorithm?

  • A self-learning (not a human developed code) code, performs data analysis and extracts patterns (business characteristics) in data for business application development - a modern approach to application/software development.
  • Automatically understands and extracts data pattern a modern approach (change in business circumstance) and performs data analysis based on the new/changed data. - No code change required to implement changes that took place in the data (change in business)
Machine Learning Libraries Used

  • Sklearn 0.19.0 (Scikit Learn)
  • Pandas 0.20.3

Clustering Model Used

  • K-Means Clustering

Model Building Blocks

There are several technical and functional components involved in implementing this model. Here are the key building blocks to implement the model.

Model Building Blocks

Model Implementation Steps

AI model implementation, to address a given problem involves several steps. Here are the key steps that are involved to implement the model. You can customize these steps as needed and these steps are developed for participant learning purpose only.

Model Implementation Steps

Model Implementation Code Block

  • # Step 1- Import the Required Libraries
    import pandas as vAR_pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.cluster import KMeans
  • # Step 2- Import Clustering Data
    vAR_df = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Train_Data)
  • # Step 3 - Convert Categorical Data into Numerical Values using Label Encoder
    vAR_le = LabelEncoder()
    vAR_Transaction_Type_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,7])
    vAR_Transaction_Type_Conversion_df= vAR_pd.DataFrame(vAR_Transaction_Type_Conversion,columns={'Transaction_Type_Converted'})
    vAR_Data_Category_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,9])
    vAR_Data_Category_Conversion_df= vAR_pd.DataFrame(vAR_Data_Category_Conversion,columns={'Data_Category_Converted'})
    vAR_Doc_Date_Conversion = vAR_le.fit_transform(vAR_df.iloc[:,3])
    vAR_Doc_Date_Conversion_df = vAR_pd.DataFrame(vAR_Doc_Date_Conversion,columns={'Doc_Date_Converted'})
    # Attached the Converted Numerical Data to the main dataframe
    vAR_df1 = vAR_df.merge(vAR_Transaction_Type_Conversion_df,left_index=True, right_index=True)
    vAR_df2 = vAR_df1.merge(vAR_Data_Category_Conversion_df,left_index=True, right_index=True)
    vAR_df3 = vAR_df2.merge(vAR_Doc_Date_Conversion_df,left_index=True, right_index=True)
    vAR_df4 = vAR_pd.read_excel(vAR_Fetched_Data_Train_All_Features)
  • # Step 4 - Clustering the Data
    vAR_model = KMeans(n_clusters=7,random_state=0)
    vAR_model.fit(vAR_df4)
    vAR_Class = vAR_model.labels_
    vAR_Class_df = vAR_pd.DataFrame(vAR_Class,columns={'New_Group'})
    vAR_Class_df
    vAR_df4_Class_Merge = vAR_df4.merge(vAR_Class_df,left_index=True, right_index=True)
    vAR_df4_Class_Merge
    vAR_df5 = vAR_df4_Class_Merge.iloc[:,4]
  • # Step 5 - Review Learning Algorithm
    vAR_model.predict(vAR_df4)
  • #Step 6 - Import Test Data
    vAR_df6 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Test_Data)
    #vAR_Data_Category_Conversion_test = vAR_le.fit_transform(vAR_df5.iloc[:,9])
    #vAR_Data_Category_Conversion_test_df = vAR_pd.DataFrame(vAR_Data_Category_Conversion_test,columns={'Data_Category_Converted_test'})
    vAR_df7 = vAR_pd.read_excel(vAR_Fetched_Data_Test_All_Features)
    vAR_Features_test = vAR_df7
    vAR_model1 = KMeans(n_clusters=4,random_state=0)
    vAR_model1.fit(vAR_Features_test)
  • # Step 7 - Running Model on Test Data
    vAR_Labels_Pred = vAR_model1.predict(vAR_Features_test)
  • # Step 8 - Review Model Outcome
    vAR_Labels_Pred = vAR_pd.DataFrame(vAR_Labels_Pred,columns={'Predicted_Inter_Transaction_Pair'})
  • # Step 9 - Write Model Outcome to File
    vAR_df8 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Train_Data)
    vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)
    vAR_df10 = vAR_df9.to_excel(vAR_Fetched_Data_Model_Path)
  • # Step 10 - To open and view the file outcome
    vAR_df11 = vAR_pd.read_excel(vAR_Fetched_Data_Model_Path)
    vAR_df11.head()
Model Implementation Steps

Step 0: Open Jupyter Notebook

Jupiter notebook is launched through the command prompt. Type cmd & Search to Open Command prompt Terminal.

Model Implementation Steps

Now, Type Jupiter notebook & press Enter as shown

Model Implementation Steps

After typing, the Below Page opens

Model Implementation Steps

Open a New File or New Program in Jupyter Notebook

To Open a New File, follow the Below Instructions

Go to New >>> Python [conda root]

Model Implementation Steps

Give a meaningful name to the File as shown below.

Model Implementation Steps

Model Implementation Steps
Step 1- Import Required Libraries Used

For our Model Implementation we need the Following Libraries:

Sklearn: Sklearn is the Machine Learning Library which contains numerous other libraries like numpy, scipyetc. which are used for numerical & scientific computations.

Pandas: Pandas is a library used for data manipulation and analysis. For Our Implementation we are using it for Importing the Data file & Creating Data frames (Stores the Data).

Import Required Libraries Used

Step 2 - Import the Clustering Data

Next immediate step after importing all libraries is getting the Training data imported. We are importing the Clustering data stored in our local system with the use of Pandas library.

Import the Clustering Data

Step 3 - Feature & Label Selection

Step 3 of the Implementation is Feature Selection. Machine learning works on a simple rule – if you put garbage in, you will only get garbage to come out. By garbage here, we mean noise in data. This becomes even more important when the numbers of features are very large. We need only those features (Input) that are function of the Labels (Outputs). Ex: To Predict whether the given fruit is an apple or orange Color/Texture of the Fruit becomes a feature to be Considered. If the Color/Texture is Red then it an Apple, If it’s Orange its Orange.

The Features Selected must Numerical. If not, they have to be Converted to numerical values from categorical values. In our Scenario we use Label Encoder for the Conversion.

The Features Selected are Company, Document Date, Trading Partner, & Transaction Type. The Label is the Target Variable is the Clusters Predicted by the model.

Feature & Label Selection

Step 4 - Clustering the Data

Step 4 is Clustering the data meaning Making the model to group/cluster, understand & recognize the Pattern in the data. .

vAR_model.fit(vAR_df4) is to fit the model for Clustering.

Clustering the Data

Step 5 - Review the Learning Algorithm

As a next step we need to Review the Algorithm as to how it has learned from the Features we Provided as show

Review the Learning Algorithm

Step 6 - Import Test Data

Import the Test Data, this is the data used to test as to how the Model Performs

Import Test Data

Step 7 - Running the Model on Test Data

Next We Test the Model with the Test Data as shown

Running the Model on Test Data

Step 8 - Review the Model Outcome

Next We check the Output of Model i.e. the Prediction it has made on the test data

Review the Model Outcome

Step 9 - Write the Model to a File

Next We Write the Model Output to an excel file for analysis.

Write the Model to a File

Step 10 - Open the File to View the Outcome

Open the Written File & Check the Outcome as Shown. Execute to View the data

Open the File to View the Outcome

Open the File to View the Outcome

Conclusion

In this lab work, we have used K-Means Clustering, an Unsupervised Learning model to identify the intercompany transaction which has difference in their booking. The model performed well on the test data & predicted the outcome as expected. For further data analysis and business decision the model outcome is stored on persistent storage - File.

This is a very basic implementation to learn and better understand the overall steps and processes that are involved in implementing an Unsupervised machine learning model. There are a lot more steps, processes, data and technologies involved. We strongly request and recommend you to learn more and prepare yourself to address real-world problems.

Appendix

Model Fitting in Machine Learning

Fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. A model that is well-fitted produces more accurate outcomes, a model that is overfitted matches the data too closely, and a model that is underfitted doesn’t match closely enough. Fitting is the essence of machine learning. If your model doesn’t fit your data correctly, the outcomes it produces will not be accurate enough to be useful for practical decision-making.

Types of Fitting

  • Regular Fitting
  • Over Fitting
  • Over Fitting
Best Fitting

The model is Best Fitting, when it performs well on training example & also performs well on unseen data. Ideally, the case when the model makes the predictions with 0 error, is said to have a best fit on the data. This situation is achievable at a spot between overfitting and underfitting. In order to understand it we will have to look at the performance of our model with the passage of time, while it is learning from training dataset.

Training Data Set (Example-1)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set (Example-1)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set

Best Fitting Model Code Block (Example-1)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_1)

print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-1)

Best Fitting Model Plotted

Training Data Set (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set (Example-2)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set

Best Fitting Model Code Block (Example-2)

def Verify_Model_Bestfit():

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

]vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_2,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='bky')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_2)

print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-3)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Best Fitting Model Code Block (Example-3)

Best Fitting Model Code Block

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Bestfit():

import matplotlib.pyplot as vAR_plt

vAR_model.labels_

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Best_Fit_File_Example_3,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='gcg')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Best_Fit_Image_Example_3)

\print(Verify_Model_Bestfit())

Best Fitting Model Plotted (Example-3)

Over Fitting

The model is Overfitting, when it performs well on training example but does not perform well on unseen data. It is often a result of an excessively complex model. It happens because the model is memorizing the relationship between the input example (often called X) and target variable (often called y) or, so unable to generalize the data well. Overfitting model predicts the target in the training data set very accurately.

Training Data Set (Example-1)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-1)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Over Fitting Model Code Block (Example-1)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Overfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Over_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Differnce_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Over_Fit_Image_Example_1)

print(Verify_Model_Overfit())

Over Fitting Model Plotted (Example-1)

Training Data Set (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-2)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Over Fitting Model Code Block (Example-2)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Overfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Over_Fit_File_Example_2,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='cyb')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Over_Fit_Image_Example_2)

print(Verify_Model_Overfit())

Over Fitting Model Plotted (Example-2)

Training Data Set (Example-3)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Test Data Set (Example-3)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Over Fitting Model Code Block (Example-3)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Overfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Over_Fit_File_Example_3,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='bby')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Over_Fit_Image_Example_3)

\print(Verify_Model_Overfit())

Over Fitting Model Plotted (Example-3)

Under Fitting

The predictive model is said to be Underfitting, if it performs poorly on training data. This happens because the model is unable to capture the relationship between the input example and the target variable. It could be because the model is too simple i.e. input features are not expressive enough to describe the target variable well. Underfitting model does not predict the targets in the training data sets very accurately. Underfitting can be avoided by using more data and also reducing the features by feature selection.

Training Data Set (Example-1)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically.

Test Data Set (Example-1)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Under Fitting Model Code Block (Example-1)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Underfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Under_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Under_Fit_Image_Example_1)

print(Verify_Model_Underfit())

Under Fitting Model Plotted (Example-1)

Training Data Set (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically.

Test Data Set (Example-2)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Under Fitting Model Code Block (Example-2)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Underfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Under_Fit_File_Example_2,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='bky')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Under_Fit_Image_Example_2)

print(Verify_Model_Underfit())

Under Fitting Model Plotted (Example-2)

Training Data Set (Example-3): The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically.

Test Data Set (Example-3)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Under Fitting Model Code Block (Example-3)

if vAR_Fetched_Data_Model_Fitting_Best_Fit_Test =='Y':

def Verify_Model_Underfit():

import matplotlib.pyplot as vAR_plt

vAR_df8 = vAR_pd.read_csv(open(vAR_Fetched_Data_Under_Fit_File_Example_3,'r',encoding ='utf-8'))

vAR_df9 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='gcg')

vAR_plt.xlabel('Company')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Under_Fit_Image_Example_3)

print(Verify_Model_Underfit())

Under Fitting Model Plotted (Example-3)

Cross Validation

Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.

Training Data Set (Example-1)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set (Example-1)

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set
Cross Validation Model Code Block (Example-1)

if vAR_Fetched_Data_Cross_Validation_Required =='Y':

#from sklearn import datasets

from sklearn.model_selection import cross_val_predict

from sklearn.cluster import KMeans

import matplotlib.pyplot as vAR_plt

vAR_model = KMeans(n_clusters=7,random_state=0)

vAR_model.fit(vAR_df4)

vAR_model.labels_

]vAR_Predicted = cross_val_predict(vAR_model, vAR_df4, cv=2)

vAR_fig, vAR_ax = vAR_plt.subplots()

vAR_Labels_Pred = vAR_model.labels_

vAR_ax.scatter(vAR_Labels_Pred, vAR_Predicted, edgecolors=(0, 0, 0))

vAR_ax.plot([vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], [vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], 'k--', lw=4)

vAR_ax.set_xlabel('Actual Difference in Booking')

vAR_ax.set_ylabel('Predicted Difference in Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Cross_Validation_Image_Example_1)

Cross Validation Model Plotted (Example-1)

Cross Validation Model Plotted

Training Data Set (Example-2)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set (Example-2): Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set
Cross Validation Model Code Block (Example-2)

if vAR_Fetched_Data_Cross_Validation_Required =='Y':

#from sklearn import datasets

from sklearn.model_selection import cross_val_predict

from sklearn.cluster import KMeans

import matplotlib.pyplot as vAR_plt

vAR_model = KMeans(n_clusters=7,random_state=0)

vAR_model.fit(vAR_df4)

vAR_model.labels_

vAR_Predicted = cross_val_predict(vAR_model, vAR_df4, cv=4)

vAR_fig, vAR_ax = vAR_plt.subplots()

vAR_Labels_Pred = vAR_model.labels_

vAR_ax.scatter(vAR_Labels_Pred[:15], vAR_Predicted[:15], edgecolors=(0, 0, 0))

vAR_ax.plot([vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], [vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], 'k--', lw=4)

vAR_ax.set_xlabel('Actual Difference in Booking')

vAR_ax.set_ylabel('Predicted Difference in Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Cross_Validation_Image_Example_2)

Cross Validation Model Plotted (Example-2)

Cross Validation Model Plotted

Training Data Set (Example-3)

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set (Example-3): Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set

Cross Validation Model Code Block (Example-3)

if vAR_Fetched_Data_Cross_Validation_Required =='Y':

#from sklearn import datasets

from sklearn.model_selection import cross_val_predict

from sklearn.cluster import KMeans

import matplotlib.pyplot as vAR_plt

vAR_model = KMeans(n_clusters=7,random_state=0)

vAR_model.fit(vAR_df4)

vAR_model.labels_

vAR_Predicted = cross_val_predict(vAR_model, vAR_df4, cv=4)

vAR_fig, vAR_ax = vAR_plt.subplots()

vAR_Labels_Pred = vAR_model.labels_

vAR_ax.scatter(vAR_Labels_Pred[:10], vAR_Predicted[:10], edgecolors=(0, 0, 0))

vAR_ax.plot([vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], [vAR_Labels_Pred.min(), vAR_Labels_Pred.max()], 'k--', lw=4)

vAR_ax.set_xlabel('Actual Difference in Booking')

vAR_ax.set_ylabel('Predicted Difference in Booking')

#vAR_plt.show()

vAR_plt.savefig(vAR_Fetched_Data_Cross_Validation_Image_Example_3)

Cross Validation Model Plotted (Example-3)

Cross Validation Model Plotted

Hyperparameter Tuning

Hyperparameter Optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

Training Data Set

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Hyperparameter Tuning

Hyperparameter Tuning Code Block Before Tuning

if vAR_Fetched_Data_Hyperparameter_Tuning_Required =='Y':

def Before_Hyperparameter_Tuning():

from sklearn.cluster import KMeans

vAR_model = KMeans(n_clusters=6,random_state=0)

vAR_model.fit(vAR_df4)

vAR_df6 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Test_Data)

#plt.scatter(vAR_df.iloc[:,0],vAR_df.iloc[:,12])

vAR_Features_Test = vAR_pd.read_excel(vAR_Fetched_Data_Test_All_Features)

vAR_Labels_Pred = vAR_model.predict(vAR_Features_Test).astype(int)

vAR_Labels_Pred = vAR_pd.DataFrame(vAR_Labels_Pred,columns={'Predicted_Inter_Transaction_Pair_Difference_Booking'})

vAR_df7 = vAR_pd.read_csv(open(vAR_Fetched_Data_Under_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df8 = vAR_df7.iloc[:,:-1]

vAR_df10 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

import matplotlib.pyplot as vAR_plt

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company Code')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

#vAR_plt.savefig(vAR_Fetched_Data_Before_Hyperparameter_Tuning_Image)

print(Before_Hyperparameter_Tuning())

Hyperparameter Tuning Plotted Before Tuning

Hyperparameter Tuning Plotted Before Tuning

Training Data Set

The training data set is the actual dataset used to train the model for performing various Machine Learning Operations (Regression, Classification, Clustering etc.). This is the actual data with which the models learn with various API and algorithm to train the machine to work automatically

Training Data Set

Test Data Set

Test data set helps you to validate that the training has happened efficiently in terms of either accuracy, or precision so on. Actually, such data is used for testing the model whether it is responding or working appropriately or not.

Test Data Set

Hyperparameter Tuning Code Block After Tuning

if vAR_Fetched_Data_Hyperparameter_Tuning_Required =='Y':

def After_Hyperparameter_Tuning():

from sklearn.cluster import KMeans

vAR_model = KMeans(n_clusters=12,random_state=0)

vAR_model.fit(vAR_df4)

vAR_df6 = vAR_pd.read_excel(vAR_Fetched_Data_Source_Path_Test_Data)

#plt.scatter(vAR_df.iloc[:,0],vAR_df.iloc[:,12])

vAR_Features_Test = vAR_pd.read_excel(vAR_Fetched_Data_Test_All_Features)

vAR_Labels_Pred = vAR_model.predict(vAR_Features_Test).astype(int)

vAR_Labels_Pred = vAR_pd.DataFrame(vAR_Labels_Pred,columns={'Predicted_Inter_Transaction_Pair_Difference_Booking'})

vAR_df7 = vAR_pd.read_csv(open(vAR_Fetched_Data_Under_Fit_File_Example_1,'r',encoding ='utf-8'))

vAR_df8 = vAR_df7.iloc[:,:-1]

vAR_df10 = vAR_df8.merge(vAR_Labels_Pred,left_index=True, right_index=True)

import matplotlib.pyplot as vAR_plt

vAR_plt.scatter(vAR_df9.iloc[:,0],vAR_df9.iloc[:,13],s=100, c='rgb')

vAR_plt.xlabel('Company Code')

vAR_plt.ylabel('Predicted_Inter_Transaction_Pair_Difference_Booking')

#vAR_plt.show()

#vAR_plt.savefig(vAR_Fetched_Data_Before_Hyperparameter_Tuning_Image)

print(After_Hyperparameter_Tuning())

Hyperparameter Tuning Plotted After Tuning

Hyperparameter Tuning Plotted After Tuning

Content Developer

Our team is comprised of MIT learning facilitators, Harvard PhDs, Stanford alumni, leading management consulting experts, industry leaders and proven entrepreneurs. Collectively, our team brings business and technology together with risk-free implementation of artificial intelligence for enterprise.

Customers’ Vocal Endorsements
We have been delivering impactable products and services on artificial intelligence, data engineering, finance, analytics, training and talent development for every business function. We work closely with senior executives as well as technical developers.


Download Request
Contact

Point of Contact

Jothi Periasamy
Chief AI Architect


Address

2100 Geng Road
Suite 210
Palo Alto
CA 94303


Contact e-Mail

lnfo@DeepSphere.AI


Contact Phone

(916)-296-0228


Web

https://www.deepsphere.ai

Join our Newsletter list to get all the latest articles, posts and other benefits