# DP-100 Microsoft Data Science – Feature Selection – Select a subset of Variables or features with highest impact

**Feature Selection – Section Introduction**

Hi in this lecture. Let’s see what is feature selection? Let’s say we want to predict whether the loan of a customer would be approved or not. This might sound like an easy problem based on whatever we have learned so far. However, you might be surprised with the kind of data that may be provided to you. They can and be hundreds of variables or features that will be in a dataset. Personal information may include name, gender, marital status, age, ETCA. There will be financial as well as demographic information such as income levels, credit history and so on. The data from the loan provider may include what type of loan product is being sold, is it a car house loan, as well as whether she is an existing customer from a group such as high net worth, et cetera. You may even have details provided through application, education details and employment information.

You may even have a co applicant who again will have all or some of this information. So do we include all these features? We may be tempted to, but remember our model is as good as the data that we provide. Remember the saying the garbage in is garbage out? That is so true with machine learning as well that’s when selecting the best features which explain the behavior or pattern in the data is so important, let’s see the benefit of using feature selection. It makes it so much easier to interpret and explain the results to someone else when we use feature selection because you can explain the impact of choosing one variable over the other as we would have lesser number of features.

The training time is also less and we get rid of irrelevant features or noise within the data and a result of that our accuracy improves. This is highly true for the data set with high number of variables. We have seen the curse of dimensionality when we have to deal with a large number of features, feature selection avoids the problem of curse of dimensionality. Last but not the least, we want to see a general pattern within the data and not an equation that covers every point. Selecting most relevant features enhances this generalization and reduces the overfitting. Well, there are three main methods of feature selection filter based, wrapper methods and embedded ones. In case of filter based methods, it uses the correlation among various variables or features and it selects the most relevant ones. So basically you may have a full set of features.

They are then passed through various statistical tests such as Pearson correlation, Chi Square, Mutual Information, Fisher score, et cetera. In relation to the predicted variable or the Y variable and depending upon the score of the tests, we either select or reject a particular feature. All right, we will go through the filterbased feature selection in the next few lectures and we’ll also perform a lab on the same. The next is the wrapper method. Pay close attention to this. This could be slightly tricky to understand. In case of wrapper methods, we choose different features from the available ones. We then run the algorithm and measure its performance.

We then repeat this process for multiple subsets of the input features and find the subset that provides us the best performance. The performance parameters may be different for different types of algorithms. For example, in case of classification we may be interested in measuring best accuracy or precision, while in case of a regression problem we may be interested in getting least root mean square value. There are various methods of how should we choose and discard these features. Forward selection starts with only one variable and keeps on adding more with every iteration. It then stops as the model performance does not improve beyond a point. In case of backward elimination, we start with all features and remove or drop a feature with every iteration until we see not a very significant improvement in the model performance.

And in case of recursive, it randomly chooses the features and selects the best or removes the worst at every iteration. I hope that provides you the intuition behind the feature selection methods. In case of Azure ML, we have three methods of feature selection filter based, which makes use of various statistical techniques to find the correlation and fished as well as permutation feature importance, which is similar to wrapper methods. We have dedicated lectures for all the three in this particular section, so let’s jump into the next lecture and learn one of these methods. Thank you so much for joining me in this one.

**Pearson Correlation Coefficient**

Hi. In the previous lecture we learnt about what is feature selection, its benefits and importance. In this lecture, we will go through one of the first methods of filter based feature selection. Filter based feature selection has got multiple feature scoring method. We will go through some of these in this particular section. In this lecture, we are going to COVID Pearson correlation coefficient. It’s also denoted by letter r. This is the formula for calculating the r. We will see how we use this in just a few seconds. It simply measures the linear correlation between two variables and can range from minus one to plus one. As we saw during the initial few lectures, two variables x and y are said to have a positive correlation if the value of y increases with x and negative if otherwise.

R will have a positive value in case of a positive correlation, and less than zero if there is negative correlation. All right, let’s see how we can do feature selection using r. We have this data set that predicts the price of the vehicle based on various parameters. In this sample that we have chosen, engine size and the horsepower that it produces are the two variables. So as per the formula, we calculate the standard deviation of all the columns, including price, which is our predicted variable. Next, based on this formula, we calculate the value of r.

You can try it out, but it really is not needed. You can simply follow me and things will be clear. Similarly, we also find the value of r for horsepower. Please note that these values are being calculated in relation to our predicted variable of price. So now when we compare the values of r, we see that engine size has a higher value of correlation coefficient, which means it explains variation in price better than the horsepower. So if we had to choose between these two variables, we will go with engine size. This process is repeated for all the eligible variables or features, and depending upon how many variables we need, we can select those based on their Pearson correlation coefficient. I hope that explains what is Pearson correlation coefficient. It has got some distinct advantages.

It not only identifies the type of correlation, but also the degree and extent of it. And it’s among the easiest to interpret as well as to explain. However, it also brings in some disadvantages with it. The value of r is affected by the presence of outliers. So if we have huge amount of outliers or discrepancy in the observations, then the value of r might get affected because of that. Also, it’s not good for nonlinear relationship among the variables.

All right. Also it is applicable only on the continuous type of variables and if we do not apply it across the features, we may select variables which have causal effect. In this example, engine size and horsepower has a direct causal relationship. The value of HP increases with the value of engine size and when we are selecting these from large set of variables or features, we may want to find out the causal relationship before selecting them for model building. I hope that explains how Pearson correlation coefficient works. I will see you in the next lecture with another feature selection technique. Eight.

**Chi Square Test of Independence.**

Hello and welcome in this lecture of filterbased feature selection. Today we are going to COVID Chi Squared feature selection method. It has been developed by Karl Pearson who also provided us the Pearson correlation coefficient for continuous variables. While chi-square provides us the relationship or association between categorical variables and when we apply it to two categorical variables, it is also known as the test of independence and checks if the two variables or features are correlated. Let’s see in brief how the chi square works and finds out the correlation between two categorical variables. It follows certain steps and I’m going to walk you through some of these steps. Let’s say we have this data set which has got two variables.

One is the flight status and second is the weather. We want to find out if there is any relationship between variable weather and the flight status. So as the first step we start with a null hypothesis. What it means is there is no relationship, which also means that as an alternate hypothesis, there is a relationship. Okay, next we define the alpha, which is nothing but a cut of probability. By default it is set to 5%, which means there is a maximum 5% chance that our null hypothesis of no relationship is true. Anything beyond that means that there is a relationship. In the next step it builds a frequency table of two categorical variables for their values.

So in this case we build a matrix of how many times the flight was delayed when it was rainy, sunny or overcast. We do the same for on time performance. From this we calculate something known as the degrees of freedom, which is nothing but the product of number of rows minus one and number of columns minus one, which gives us a value of two. In the next step we formulate and state our decision rule. This is done from the chi square table, something similar to what we are familiar as lock table or similar. This is available freely on the internet and is used while performing this particular test.

So we then look for a value of chi square or x square that corresponds to our degrees of freedom and our cutoff probability. In this case it is 5. 99 one. So we say that we will reject the null hypothesis if our x square value is more than the 5. 99 one.

All right, so be with me as we find out this value for our data set. So, before we calculate the chi square value, we need to find out what should be the expected values for this metrics based on all the occurrences and when we distribute them based on the total probability. Okay, so in total the flight was delayed for 65 times, also it was raining 47 times. This happened out of total 200 observations. So the expected value for the delayed flights when it was raining should be 65 multiplied by 47 divided by the total number of occurrences so the expected frequency for this particular cell will be 15, as we calculated here. We similarly calculate the values for other cells by multiplying the respective total frequencies and dividing it by the total number of observations. In the subsequent step, we simply calculate the chi square values for each cell using this formula and also do a sum of it. It gives us the value of 55. 6. So for these two columns of weather and flight stators, our chi square value is 55. 6. So to draw the conclusion, because our chi square or x square value is greater than 5. 991, we reject the null hypothesis and state very conclusively that flight stators and weather are correlated. I hope that explains how the chi square tests the independence between the two variables. And once it has been established, it’s easy to choose the feature in the model building. I suggest you go through the video and revise these tapes once again, as it could be difficult to get it in the first attempt. Thank you so much for joining me here. I will see you in the next one with another feature selection technique.

**Kendall Correlation Coefficient**

Hi. Through this series of lectures in this section we have learnt various feature selection methods such as Pearson Correlation chisquare and we will learn Kendall rank correlation in this lecture. Currently, I’m simply focusing on providing you the intuition behind all of these feature selection techniques. We will surely have a package practical to demonstrate these methods. After few lectures, we will also see which one is better in certain scenarios. So let’s see what is Kendall rank correlation? Well, it’s named after Maurice Kendall and is a measure of rank correlation. Rank correlation simply means when two variables are ranked, the change in one shows a similar positive or negative change in the ranking of the other when we measure it across two different points.

All right, let’s see what we mean by that by using this data set for vehicle price prediction, let’s see how Kendall correlation helps us identify the relationship between two variables of engine size and our predicted variable or feature vehicle price. So we have sorted these observations on the engine size and next step is to find a pair and analyze the effect of change in the x or predictor feature, which in this case is the engine size. So we have the Xi as 10 eight for this particular point and Yi is 16,430. We then choose another point j and measure the XJ and y j. Because we have ranked or sorted the data set on engine size, XJ should be higher and YJ is also higher than Yi.

In this case, we call such a pair as concordant pair. We also have a pair such as this where even though the value of x or engine size has increased, the price has actually gone down in comparison to Xi and Yi. Such a pair is called as discordant pair. We then compare all the observations and find out all the concordant and discordant pairs. The Kendall Correlation coefficient is then calculated using this formula. The maximum number of pairs you can build in any data set will always be n multiplied by n minus one divided by two. All right, what it tells us is if with increasing value of x y also increases, we will have this coefficient as one.

So there is a strong positive correlation between x and y. Similarly, if y always decreases with increasing value of y, all the pairs will be discordant pairs and hence this coefficient will be minus one. If x and y are independent, this value will be close to zero. And when we compare the two features against the predicted feature, the one with higher value of Kendall Correlation Coefficient will be selected as the one with higher prediction ability. All right, in the next lecture we will see Spearman’s Correlation coefficient before we compare these and also perform an experiment to select the features from a dataset. Thank you so much for joining me here. I will see you in the next lecture with Spearman’s Correlation Coefficient. You.

**Spearman’s Rank Correlation**

Hi. So far we have seen Pearson, chisquare and Kendall correlation techniques of filter based feature selection methods. In this lecture, we will learn about the Spearman’s rank correlation coefficient. It’s named after Charles Spearman and is a measure of rank correlation between two ordinal variables. The main difference here is it measures the correlation using a monotonic function. So, for example, we have these values of x and y where y is a nonlinear function of x, though we can see that the value of y increases with x and there is a direct correlation between x and y pearson correlation may give us wrong results as it measures the linear relationship, whereas in case of spearman correlation, it measures the correlation with Y as a function of X.

And hence it can be used for nonlinear relationship, such as this one. It’s denoted by Greek symbol row. And this is the formula for calculating the row. Let’s calculate it for the engine size and the horsepower. We will then compare them and select the best feature that has the greatest ability to predict the price of the vehicle. So, the first step is to calculate the rank of all the observations for the feature engine size. And this is how it looks. In the next step, we do the same for column price. But remember, we simply provide the rank without changing the order of the feature values.

So this is how it will look. And because we have ranked them in an ascending order, this value 13 four nine five takes the rank of one and 24 five, six five takes the rank of 13. In the third step, we calculate the difference of these ranks and their square values, as you can see in these two columns. For now, just follow me. As these calculations may not be required in the real life scenario, I’m simply explaining it so that you have a better understanding of how it works. So, only thing that remains now is to calculate the value of Spearman’s correlation coefficient, which is row.

So, using this formula, we calculate the value of row as 0. 6 for engine size in correlation to the price of the vehicle. All right, we do the same for horsepower of the vehicle, and it provides us the value of zero point 49 in correlation to vehicle price. And because the engine size has higher value of row, we conclude that engine size has more predicting power for the vehicle price than the horsepower of the vehicle. I hope that explains how we calculate the Spearman’s correlation coefficient for feature selection. In the next lecture, we will compare various correlation coefficients and in which scenarios we should use a particular one. Thank you so much for joining me in this lecture. See you soon.

**Comparison Experiment for Correlation Coefficients**

Hi. In the previous lectures of feature selection techniques we learnt about the Pearson, Candle and Spearman’s Correlation coefficient. Let’s now test them in different scenarios to see which one provides us the best results. So let’s go to the Azuramal Studio and create a small experiment around it. All right, here we are. I have created these four scenarios in parallel. So we have this Enter data manually module where I have entered some X and Y rows. Then I have these three filter based feature selection modules. You can search them here and you will find them under Feature Selection. I have set up the three correlation coefficients in these three modules here. The first one is Pearson, second is Spearman. And then the Candle correlation Coefficient. You can set that by selecting the feature scoring method from this drop down here. The parameter operate on feature column only means if we have specifically assigned one of the columns as label or predicted variable Y, then it will ignore it. The target column is to select our predicted variable or Y.

All these scoring methods will try to find out the correlation coefficient of our features in comparison to this variable that we will set. And finally, we have number of desired features that we want to select. In this case I only have one, but in cases where you have large number of features, you can specify a number and the module will select those many features for you. The feature selection module produces two outputs. The first one on this node is the filtered data set. So if you have asked for ten features from, let’s say 20, you will see a data set with ten features as an output.

The second node is where you will see the correlation coefficient values for various features in the subsequent modules. I am simply merging the result and also adding a column of scoring method names for better understanding. All right, let’s visualize our data set and see what’s in store. All right, let’s compare the X and Y here and as you will see it has a direct linear relationship. That’s because Y is twice that of x in this particular data set. Let me close this and I have already run these experiments and let’s now visualize the output of all the scoring methods. Well, as you can see, all of them have given us a coefficient of one, so no problem using any of those in this situation. However, whenever we are dealing with continuous variables, it’s always better to use Pearson correlation as number of ranks that we may produce can make these two methods inaccurate. All right, now let me close this and let’s now see the second experiment. In the second one I have made slight changes to the data. Let us visualize the data and see what those changes are.

And here we are. Let’s compare the X and Y on a scatter plot and as you can see, there is no linear relationship here. Y is actually x raised to the power of five. And hence the graph here looks like this. Let me close this and let’s see how the different correlation methods behave. Let’s visualize the output. As you can see, value of r has gone down in comparison to rho and Tau. This is despite the fact that there is a direct relationship between x and y. Now, why this has happened this has happened because Pearson assumes a linear correlationship. So, if we see a nonlinear correlationship or relationship between x and y, it is safe to use Spearman or Kendall correlationship. Now, let’s see which one to use between Spearman and Kendall when we are dealing with ordered data. Let me go to the third experiment and let’s visualize the data set. Well, I have simply changed some of these values so that there is some discrepancy in the relationship.

So, as you can see, the value two appears to be an outlier, though not by a huge margin. And some values here are not following the function. Let me close this and let’s visualize the output. So now Spearman’s coefficient provides us the best correlation compared to Pearson and Kendall. So, when we are dealing with ordered variables and have few outliers which are not huge by margin, spearman will provide us the best result. All right, so let’s see the last and the extreme one. And let me visualize this first. And as you can see here, there is this outlier for x equal to two. Let’s see what effect an outlier like this will have on our correlation coefficients. So now the Pearson coefficient has actually gone for a toss with the presence of just one outlier.

The Spearman has also dropped significantly to 0. 657, while Kendall provides us the best correlation coefficient, which we know does exist, except for that one observation, which could be because of a human error or any such mistake. So we can safely say that in the presence of significant outliers, kendall provides us the best results. Well, Azuramal has given us this ability to compare many modules and interpret our models using visual flowcharts. We should make the best of it by comparing various modules parallel. I hope that explains to you how and when these correlation scoring methods should be used. Thank you so much for joining me in this lecture and I will see you soon.

**Filter Based Selection – AzureML Experiment**

Hi. In this lecture we are going to see how feature selection works in an experiment. I have this experiment already created and you have seen all the modules I have used here in our previous lectures. So I’m simply going to walk you through this experiment and building it again to save some time. Here I have this wine quality data set which is same as what I had provided earlier during multiclass classification. It has got 13 columns and we need to ignore this last column as this is derived from previous data set. I have the edit metadata module here. It simply changes the quality to a label column or predicted column.

The select columns module simply excludes the last column of quality clipped. Then on this side of the experiment we have a split data module with 70% split with stratification. We are using multiclass decision forest with default parameters and then the usual suspects of train score and validate model. On this side I have created a parallel branch but before splitting I have added a filterbased feature selection while rest of the steps after it remain the same. I have used Kendall correlation and extracted the seven most influential features. Let’s visualize the feature score. As you can see, various features are ranked as per their Kendall score and as we go towards the right, these four features which we are going to discard have a very low score.

Let me close this and let’s visualize the output or filtered data set produced by this module. As expected, it has got only eight columns, seven extracted features and one predicted feature or level column of quality. Let me close this and let’s visualize the evaluate model results. As you can see, there is hardly any difference in our result. You may say that the result has actually gone down and you were expecting an improved performance.

Well, your expectations are not wrong. However, this is such a small data set with only twelve features that you cannot really expect a huge gain. You may see improved performance when you are dealing with hundreds of features. Moreover, you can see that though we have discarded four features, the result is same. That means our model has surely taken lesser time to execute. At the same time, we now have a better understanding and interpretation of why we got this result. We also know which are the top most factors that influence the outcome. I hope that explains the importance of feature selection and how it is implemented in an experiment. I suggest you perform the same on one of your sample data sets. That’s all I have for this lecture. Thank you so much for joining me in this one.

**Fisher Based LDA – Intuition**

Hello and welcome to the lecture on feature selection. So far we have seen filterbased feature selection as well as principal component analysis in the previous lectures. Today we will cover the feature linear discriminant analysis or feature LDA. We have seen that in PCA based feature selection we plot the data points and select the principal component based on the spread of the data. We do that using the highest egan value in PCA. We have not considered any class to be predicted. It is a completely unsupervised method of feature selection. We can also see that it is focused on spread of data among features. Let’s now see how the Fisherld performs the feature selection and how it is different from the PCA. Let’s assume that we want to reduce the dimensions of the loan approval data set and we have the data points plotted for applicant’s income and the loan amount that she has specified. The main difference while determining the LDA is that we also consider the classification of data.

So what are we trying to predict here? Well, we are trying to find out whether the loan will be approved or not. So we are dealing with the binary classification of predicting the yes or no values for the status. So we apply the same to all the data points and identify the records which fall into each of those categories. When we do that, our plot now looks something like this where the blue points represent the loanapproved status while the orange points represent loan applications that have not been approved. The process of LDA is very similar to that of PCA. That is, we plot the data points on a new axis in such a way that it can separate the data points belonging to two separate classes. As you can see in this plot, when we represent all the data points across the new axis of LD one, we can clearly separate the two classes with good accuracy. So the data points towards the left side of the LD one represent loan approvals while as we go towards the right, the data points represent the loans not approved. So how does that happen? How to draw that access which can correctly separate the two classes? Before we look into that, let’s take a moment to remember the great British statistician and biologist who used mathematics to combine men Delian genetics and natural selection. Sir Ronald Fischer. He also developed analysis of variants or ANOVA which attempts to express one dependent variable as a linear combination of other features or measurements. Fisher has also proposed a formula to determine the LDA axis. Let’s look at the previous example of loan amount versus income.

As the data belongs to two different classes, we first try to find out the mean of the data when plotted on the LDA. So we take the mean of classes and the mean of class no data. The distance D provides us the variation between these two classes. Next, we look at the variation of data within each of these classes. That is, variation of data belonging to class Y and class N. If S is the separation between the classes, then Fisherld proposes a formula of variation between classes divided by the total variation within the classes. This helps us in getting the best LDA access using the highest values of separation that summarizes the intuition or theory behind the fissure LDA. And it also brings us to the end of this lecture on fissure LDA, which is one of the most popular techniques of feature selection or reducing the dimensions of the data. In the next lecture, let’s apply the fissure LDA on a dataset and reduce the dimensions for the same. Thank you so much for joining me in this one. See you in the next one. Have a great time.

**Fisher Based LDA – Experiment**

Hi, welcome to the lecture on featureldabased feature selection. In the last lecture we saw how the LDA access is created between two features. Using the loan approval example we also saw how the two sides of the axis represent two classes and what do we mean by variation between classes and variation within the classes. These two measures are then used to calculate the separation. So now let’s go to the Azurmel studio and implement the same. So here we are. I have this experiment that we ran in one of the previous labs. You can find this data set in the supporting material for this lecture and before we start working on it, let’s first visualize the data set.

As you can see it has got 1599 rows and 13 columns. We actually don’t need this column quality as we have converted that into wine category. I have simply converted certain ratings below a threshold as low category and some to high category. We are going to ignore this column quality for this experiment and let me close this. The select column module here helps us ignore one of the columns where we have selected all but quality. Next we have usual split data module where we have split it on rows and in 60 40 ratio and stratification on column wine category.

As we are going to predict three categories of wine. I have used multiclass logistic regression with default parameters for now and then trained it, scored it and evaluated the results. This is pretty straightforward and now that you have come this far in this course you should not have problem understanding these modules.

So as we know we have used eleven independent features or variables in this experiment and one dependent or predicted variables as wine category. What if we could reduce some of these and see the impact of the same on our model prediction as all of these variables are numeric I am going to use Fissurelda for which we have seen the intuition in the previous lecture. So let me search and drag and drop it here. Connect the output of select columns module to the input of Fisherld.

It takes two parameters we need to specify the label column or the predicted column and also specify the number of features to select. So let me launch the column selector and select wine category as the label column and click OK. And let’s reduce the dimension or the number of features from eleven to say seven. And that is all we need to do for this particular module. Now let me run this. It has run successfully and as you can see it produces two outputs. The first one is the transformed data set which is nothing but a data set with seven independent columns and one column of wine category and the second one is the transformation.

This output of transformation can be used if you want to apply the same transformation on some other data set with same schema or data structure. All right, let’s now visualize the output data set. These values are nothing but the values of the new variable on the LDA axis and hence will not be same as we saw in filter based feature selection. This appears more similar to a PCA output what we saw in the PCA lecture. All right, let me close this and let’s copy and paste rest of the modules as we want to compare the two experiments. One without the Fisherld and this new one with it. So let me make some space and make right connections here.

I have also connected the output to evaluate model so that we can compare the two. So all the connections are now done and we are ready to run it. All right, it has run successfully and let’s visualize the valued model and see the comparison. Great. As you can see, the overall as well as the average accuracy has improved slightly despite we using only seven features instead of eleven. That’s also because those four features actually didn’t add any value to the prediction model. That means our model has surely taken lesser time to execute as well. I hope that explains how we can use feature LDA in an experiment. Try building this model during your own practice. For now, we have come to the end of this lecture, so see you soon in the next one and enjoy your time.