Data Science 101:- Introduction to Orange Tool Part-2

Darshil Patel
6 min readOct 19, 2021

Visual Programming with Orange Tool including How to Split our data in training data and testing data in Orange? What is the effect of splitting data on classification result/ classification model? How to efficiently use cross-validation in Orange? What is the effect of it on model output/accuracy?

Prerequisite:

Introduction to Orange Tool Part-1

Splitting Data Into Training Data And Testing Data

For splitting the dataset, we use Data Sampler Widget. Now, we will split the data into two parts, 80% of data for training and 20% for testing. We will send the first 80% onwards to build a model and remaining for testing purpose.

Firstly, connect the file to Data Sampler. Then, click on the Data Sampler and amend the changes you want in the Data Sampler

Data Sampler

It selects a subset of data instances from an input dataset.

Inputs

  • Data: input dataset

Outputs

  • Data Sample: sampled data instances (used for training)
  • Remaining Data: out-of-sample data (used for testing)

So we have to pass the whole dataset into Data Sampler Widget. By default, the dataset splits into train(70%) and test data(30%) in the Data Sampler. In Data Sampler Widget we will split our dataset into train(80%) and test data(20%).

Effects Of Data Splitting On The Classification Model & Result

To know the effect, first we have to create a workflow which tests the learning algorithms(SVM, KNN and many more)on data and score them. For this we will use Test and Score widget which gets data from Data Sampler and, train, test and score learner algorithm.

Generated Workflow

Test and Score

Tests learning algorithms on data.

Inputs

  • Data: input dataset
  • Test Data: separate data for testing
  • Learner: learning algorithm(s)

Outputs

  • Evaluation Results: results of testing classification algorithms.

After splitting the data we have to connect Data Sampler with Test & Score Widget by connecting two lines one for train data and another for test data. so by clicking on link it opens Edit Links and we have to edit links as shown below.

Link edit

Data sample(80%) -> Data( Train Data )

Remaining Data(20%) -> Test Data

This workflow utilizes Naïve-Bayes, Random Forest, Neural Network, and KNN ( K Nearest Neighbors ) Widgets to create the model. Machine learning methods are used in all of the widgets. You can connect all of your widgets with the Test & Score Widget as learner as done in the generated workflow(see the above images).

Test and Score widget must need two things to test and score as seen in above Test and Score section.

(1) Data ( Train & Test )

(2) Machine Learning Algorithm

Now after sending the models to Test & Score along with Train and Test samples, we observe their performance in the table inside the Test & Score widget. But before observing evaluation result we have to make Test and Score widget to evaluate on test samples by clicking option Test on test data on the left panel of widget as shown in the below figure. Because there are other option available for evaluation such as cross validation, Leave one out, and others. So while using test data we always test our model on test data.

Test and Score on Test Data

Importance Of Separation Of Test And Training Data

Main importance of data separation is for evaluation purpose. Because overfitting is a common problem while training a model. When a model performs exceptionally well on the data we used to train it, but fails to generalize well to new, previously unseen data points, this phenomena happens.

So test data act as new, previously unseen data points and when model evaluates on the basis of test data we come to know the actual accuracy of model. Alternatively when model evaluates on the basis of train data it gives better accuracy compare to test data, reason behind this is that model already trained on same features which we used for evaluation purpose. But such models are not generalized for real world data they just overfit to training set.

Test and Score on Training Data
Test and Score on Test Data

So the effect of splitting data on classification model is nothing but CA(Classification Accuracy). Here we can see that CA for Test on train data(left side) is greater but we know that is not consider as actual accuracy, we really want our model that can generalize to the every test data.

Cross Validation

Cross-validation is a statistical method for estimating machine learning model performance (or accuracy). It is used to prevent overfitting in a prediction model, especially when the amount of data available is restricted. In cross-validation, a set number of folds (or partitions) of the data are created, the analysis is conducted on each fold, and the total error estimate is averaged.

Splitting data into train and test data which we had done above is also type of validation called Holdout validation. One technique to improve the holdout method is to use K-fold cross validation. This strategy ensures that our model’s score is independent of how we chose the train and test set. The data set is subdivided into k subsets, and the holdout approach is applied to each subset k times.

Efficient Use Of Cross-Validation In Orange

For the same workflow using Test and Score widget we can use cross validation by clicking on option of cross validation on the left panel of widget as shown in the below image. Also we are able to change the value K folds.

Cross-Validation

As we can see we have used K=10 for cross validation so Number of folds are 10. The data set is subdivided into 10 subsets, and the holdout approach is applied to each subset 10 times.

Read this article for more details on Cross-Validation here.

Effects On Model Output & Accuracy

Cross-validation vs Holdout validation

Cross–validation is a method of evaluating a machine learning model’s ability to predict fresh data. It can also be used to detect issues like as overfitting or selection bias, as well as provide information on how the model will generalize to a different dataset. Here instead of single holdout approach it perform K times which provide better Actual accuracy of model. So we can see Cross validation accuracy is less but more accurate or generalize.

Confusion Matrices can also be used for analyzing the output.

Confusion Matrix for kNN

Conclusion:

So, there’s it. This is how you can perform basic visual programming with Orange3.

Do check out more features of the Orange tool here.

LinkedIn:

--

--

Darshil Patel

CS Grad @ University of Texas at Dallas | Passionate and Versatile Software Engineer | Expertise in Full-Stack Development, Machine Learning, and Data Science