Practice Exams:

DP-100 Microsoft Data Science – Data Processing – Solving Data Processing Challenges Part 2

  1. SMOTE – Create New Synthetic Observations

Hello and welcome to the Azure MLCOs. In today’s lecture on data processing, we are going to talk about the smart or synthetic minority oversampling technique. It might sound like a strange terminology, and if you read it on blogs or Internet, it might appear very, very intimidating. Let’s try to understand the problem first and then how the Smote addresses it for better understanding of the intuition behind Smote. By now you must have realized that while dealing with the classification problems, the percentage of the classes in the total sample plays an important role. However, there are scenarios where you may be dealing with an imbalanced data set. By imbalanced we mean there is presence of a minority class. For example, if we are trying to identify the fraudulent transactions from a data set, we won’t have them in sufficient numbers. Any good system will not have five or even 10% frauds.

You may barely have 0. 5 or lesser number of frauds in terms of percentage. In such cases, your model may try to fit the majority class and may provide a biased prediction. At the same time, it gives a false sense of accuracy. Now, in this case, even if we classify all transactions as non fraudulent, we will be 99. 5% accurate. But that’s not our objective and our model does not help despite giving a very highly accurate result. Now, some other examples could be that of manufacturing defects. With the implementation of Six Sigma and Lean practices, the occurrence of defects have also been reduced greatly. Similarly, if you need to identify rare diseases, natural disasters, or even enrollment to Premier Institutes, which have got very low percentage of acceptance compared to the applications they receive, again you will be dealing with an imbalanced data set. So how do we deal with such scenarios? One of the ways is to resample the data set by either decreasing the majority class or by increasing the minority class observations.

We can also do that randomly. So, in case of random under sampling, we simply select a small subset of the majority class and keep the minority class as it is. So, let’s say you have a total of 1000 observations and ten or 0. 1% of that are frauds. We may decide to choose only 90 normal observations. In this case, that will make the fraudulent transactions as 10% of the total sample size that we have. Definitely, it helps us balance the dataset. However, it is possible that discarded observations could have some valuable information and such an approach could lead to bias.

Similarly, in case of random oversampling, we randomly add more minority observations by copying some or all of those observations or replicating them multiple times. For example, we may increase the fraudulent transactions from ten to 110 and now they became 10% of the total observations, so there is absolutely no information loss as in case of under sampling. However, it is very much prone to overfitting as we have simply copied some observations that is where the smoke or synthetic minority oversampling technique helps us, all right? It creates new synthetic observations or new data points. Now, listen to these steps very carefully, and do not worry if you do not get it 100% in the first attempt. There is visualization of all of these steps.

Okay? First, every data point is plotted, no prizes forcing that. It then identifies the feature vector and its nearest neighbor. Then it takes the difference between the two. Finally, simply multiplies the difference with a random number between zero and one. The next step is to identify a new point on the line segment by adding the random number to the feature vector, and then repeat the process for identified feature vectors. I know that many of you would not have got it, so let’s visualize it for better understanding. As the first step, we have the data points plotted in two dimensions. Let’s zoom in to the minority class observation.

Then we identify the feature vector and its nearest neighbors. I’m not going to go into the mathematics of how it is done, as the idea here is to understand what happens next. We calculate the linear distance between these two points or feature vectors, and multiply it by a random number between zero and one. We then plot a new data point on this line over here. With the result, we got the feature vector for this new point is our synthetic data point. We then continue this for all the points and can keep on adding synthetic observations on these lines, depending upon how many data points we asked Smote to create by doing this.

From these five points, we now have created nine new but synthetic observations, thus taking our total to 14. I hope that clears how Smote creates new sample observations when we are dealing with an imbalanced data set. In this lecture, we covered what is an imbalanced data set. We also went through the challenges associated with imbalanced dataset, what is resampling, and various types of it, such as random under sampling and random oversampling. We also understood synthetic minorities over sampling technique or smoke and its intuition in the name. Next lecture, we are going to see how the implementation of Smote is done with an example dataset. Until then, enjoy your time and thank you so much for joining me in this one.

  1. SMOTE – Experiment

Hello and welcome. In the previous lecture we learned about what is imbalanced dataset, what is resampling and how the Smoked works. In this lecture we are going to implement Smote using Azure ML studio. So let’s get started. All right, here we are have already uploaded a data set for this purpose called Loan Smoke, which you can download from the course materials section. Let’s visualize it first. It has got 100 rows and 13 columns. Let’s go to the Loan Status column and as you can see, it has got eleven observations for no and 89 for yes. Eleven observations is not a minority per se, but we are simply trying to understand here how the smoke is implemented. Let me go back and bring the Smote module on the canvas. You can search for it and drag and drop it onto the canvas as we have done in the past. Make the right connections.

So let’s look at the parameters this module requires. First is the label column, which in our case is the Loan status. Let me launch the column selector, select Loan status and click OK. Alright. The second parameter is the smart percentage. 100% here would mean we want equal number of observations to be added. So if we had eleven numbers, the smoke percentage of 100 will add another eleven observations. So let’s skip it to 100%. 2nd parameter is how many nearest neighbors we want. As we have seen in the previous lecture, by increasing the nearest neighbors you get features from more cases. So let’s keep it to default.

Let’s keep the random number seed as one, two, three and let’s run it. All right. It has run successfully and it’s time to visualize it. As you can see, the total number of rows have now increased to 111 from the original 100 because there were eleven numbers in the original data set and we added 100% smart percentage points to the new one.

And the loan status now has 22 rows with no. Let me show you how the data has been transformed and which records have been added here. For that, I’m going to pause the video and add a couple of more steps to this experiment, run them and come back to you. All right, what I have just done here is I have added a split data module with splitting mode as regular expression and selected all rows with Loan Status starting with N. The result data set of the split module is then converted to CSV so that I can download it. Let’s now go to the CSV file and compare our results.

All right. I have done some sorting of records and some color coding to show the comparison. The green rows are the original ones here and as you can see for the record LP 1047, it has added two new rows. While the values of some of the features are same, the values for numeric variables and the property area are different. Here similarly, for this particular record, you see different values for different observations. If we had to do this on our own, it would have taken much more time and without any guarantee of whether the result would be same or not.

You can use Smart for almost all the data sets with imbalanced categorical feature for target variable. Once you have created such a data set, you can then proceed with the rest of the steps of the model building. I hope that has explained what is Smart and how we can implement it. In this lecture we used a data set of loan implemented Smote with 100% replication and one nearest neighbor. After that we created eleven additional or synthetic observations from the existing set of records. That concludes this lecture on Smote implementation. Thank you so much for joining me in this one and I will see you soon in the next class. Until then, enjoy your time. You.

  1. Data Normalization – Scale and Reduce

Hello and welcome. We have dealt with the summary statistics and how to deal with outliers in the previous lectures. In this short lecture, we are going to COVID the normalized data module. In case you are wondering what is normalization and why do we need to normalize the data? Well, let’s try to understand that first. Normalization is a method to standardize the range of independent variables or features of data as they could be on different scales.

You can use the normalized data module to transform a data set so that the columns in the dataset are on a common scale. By normalization, the variables are fitted within a certain range, generally between zero and one, and it is applied on numeric columns. So why do we need to normalize it? Let’s say you have two variables x one and x two and you are trying to predict the values of y and we get this equation of y which is equal to A plus b one, x one plus b two x two squared. In this case, if we have the values of x one as 1234 and so on, whereas the values of x two are in the range of thousands, a relatively small change in x two can lead to huge variations in the values of y.

However, if we normalize or standardize both x one and x two on a same scale, we can overcome this issue. In case of multivariate analysis, models are usually more stable and the coefficients that will be derived are more reliable if we normalize the data. Another reason is that any algorithm such as Clustering, which we saw earlier, which depends heavily on distance computation, will be greatly affected if we do not normalize the data. Among the various normalization methods that are available, these three Zscore, Min, Max and Logistics remain the most popular, among others. You may want to briefly look at the formulas here, or if you want a detailed explanation, you can visit the Quick Help in the Azura studio and see the details of the module. So let’s go to the Azure ML studio and perform normalization on one of the data sets. See you soon in the next lecture.

  1. Data Normalization – Experiment

Hello and welcome. In the previous lecture, we saw what is normalization and why it is important to normalize the data. In this lecture, let’s normalize couple of columns from one of the data sets. So let’s go to the Azure ML studio and perform normalization. Welcome to the Azure ML studio and let’s get the employee data set. Employee dataset has been provided in the previous lectures, and unless you have jumped to this section straight away, you should have the data set already uploaded. All right, so let’s drag and drop normalized data module and make the right connections with our employee data set. You can choose one of the transformation methods from the drop down list. I’m going to keep it as default. Let’s now launch the column selector and select the columns monthly income and Years of experience.

Click OK and we are good to run it. Let’s now visualize the data set and see how our values have now changed. As you can see, the values of both the columns are now at the same scale. Let’s plot the scatter plot here by comparing the two columns years of experience and monthly income. As you can see, nothing much has changed despite we changing the scale of both the columns. This normalized data can then be fed to the workflow down the line for further processing in a similar manner as we saw earlier. I hope in this short lecture you could understand how we can apply the normalized data module on our data set. This concludes this lecture on normalization of data. Thank you so much for joining me in this one, and I will see you in the next lecture here. Have a great time. Bye.

  1. PCA – What is PCA and Curse of Dimensionality?

Hello and welcome. We are going through some of the practical issues to deal with during the data processing stage and one such issue is how do we deal with huge number of variables or features present in a data set? This is also known as curse of dimensionality. It refers refers to various challenges that arise when analyzing and organizing the data in a high dimensional space. Typically, high dimensions here mean hundreds of, sometimes thousands of variables or features present in a data set. And as the number of features increase, the data becomes sparse in the multidimensional space. Now, this results in the lesser accuracy, a very high chance of getting a less accuracy as we increase the number of features.

Keeping the sample size same, the performance of the classifier can actually increase up to a particular optimum point. However, beyond that point the performance starts decreasing as we increase the number of features. Also, it requires higher runtime and even leads to overfitting. This requires us to reduce the number of available dimensions. Does that mean we simply eliminate some of the variables? It could be if we are doing some high impact feature selection. However, in some of the cases the features could still be relevant and we may need to reduce the number of dimensions of those. All right, and that is where the principal component analysis helps us. That’s a heavy name, but don’t worry, we will understand it with very easy and basic example. So what does principal component do or what exactly it is? Well, it tries to solve the problem we described just now by creating a new set of coordinates for the data. And how does it do that? It does that by revealing the internal structure of the data that best explains the variance in the data and thus reducing the dimensionality of the multivariate data set. Confused? Well, do not worry. Let’s visualize it and understand it with a much more simpler language.

All right. We have this data with two variables or features x one and x two. The data is plotted here on x one, x two so that we can understand the variance among the two. The ultimate objective of PCA is to reduce the dimensions and there are two easy ways here. One, we simply plot all the observations either on x one or on x two. That way we are eliminating the other variable. But is it the right thing to do? Or there can be another axis that can explain the variance better, an axis that passes through these points in this particular fashion. That brings us to the concept of Egan vectors and Egan values. Simply speaking, Egan vector is the direction and Egan values are numbers. And when we have this data in two dimensions, we create two eigenvectors perpendicular to each other so that it can span the whole two dimensional space before we attempt to reduce it.

Hence, we have these two eigenvectors or directions which are perpendicular to each other. Let’s call them EV one and EV two. I’m simply going to turn it around so that EV one and EV two appears like two dimensions of x and y. As explained earlier, Egan value is nothing but the number that explains the maximum variance or spread of the data.

As you can see, eigenvalue one explains the maximum variance, and hence we choose eigenvector one as our principal component. Next, we need to plot all of these data points on this vector or principal component. So we plot them on PC one. And we now have new values for this data point as represented by the values on PC one. I hope that explains what is principal component analysis and how it reduces the dimensions of the data. Remember, the new values of the data point are nothing but the values of those points when plotted on the principal component. All right, that brings us to the end of this lecture. In the next lecture, let’s actually reduce the dimensions using PC. Thank you so much for joining me in this one. I’ll see you soon in the next one. Until then, enjoy your time.

  1. PCA – Experiment

Hello and welcome. In the previous lecture we learnt about the Principal Component Analysis and how we reduce the dimensions using PCA. In this very short lecture, let’s go ahead and implement PCA for one of the data sets. So let’s go to the Azure ML studio. All right, I’m going to bring in the Wine Quality data set, which I hope you have already downloaded from the course materials section in the previous lectures. Next we search for principal component analysis. And there it is. Let’s drag and drop it onto the canvas. Make the right connections and as you can see, it does not require too many parameters. So let’s first select the columns we want to reduce.

So let’s launch the column selector. I’m going to select all the columns except Quality, as that is the dependent variable, and click OK. The next parameter is the number of dimensions to reduce to. We have selected eleven columns for the PCA transformation, so let me reduce those to, let’s say, five dimensions. This data set is actually not an ideal candidate for dimensionality reduction, but we simply want to learn how to use this module when you are faced with too many dimensions.

So I input five here and we are going to check this checkbox of normalize the values in the columns to a mean of zero before further processing. You can use this option for that. And we have seen what is normalization in one of the previous lectures. For sparse data set, the parameter is overridden even if we selected. So we are now good to run it. It has run successfully and let’s visualize the output. As you can see, the number of dimensions has been reduced to five as requested, apart from the one dependent variable. There is, however, an issue with Azuramal PCA.

It does not provide any further details of how these dimensions have been derived and further correlation between the original features and the new dimensions. This output can be further fed to split data and train and test model for building your own models and evaluating the results. I hope that makes it clear on how to use the Principal Component analysis for reducing the dimensions of an existing data set. That concludes this lecture on PCA Lab and I hope to see you soon in the next lecture. Thank you so much for joining me in this one and have a great time ahead.

  1. Join Data – Join Multiple Datasets based on common keys

Hello and welcome in this short lecture of data processing. Today we are going to COVID how to join the data from different data sets. It is not always ideal that you will get all the information in one data set. Very often you will be provided information in multiple data sets with common keys. There can be multiple reasons for the same. The information might be coming from different sources or might have been created at different times. For example, you may get the historical user information from the CRM database while the financial or shopping behavior from a different system. However, the data sets are linked with a common key such as user ID, and in such cases we may have to extract the relevant and related records from the two data sets.

The Add Columns module might not be of healthier, as we need to extract a subset or superset of the two data sets. In case of Azure ML, it supports inner join, left outer join, full outer join, and left semi join. There is no difference between the database joins and these joins. So if you are familiar with the concepts, you can straight away jump to the next lecture of demonstration of the joint data module. All right, let’s go through them in a bit more detail and understand the concepts of join.

The first one is inner join. Let’s say you have two data sets that have a common key of employee ID but has got different information’s, such as one has got financial information of salary and other financial details, while the other has got departments and operational information. We need to combine these data sets in such a manner that we extract common subset of the two. That is nothing but the observations which are found in both the data sets. As we discussed earlier, they will be related with a common key. So in this case, we will have these four records common across both the data sets, where the employee ID is one, three, four and seven. So the inner join of these data sets will create a new data set with these four records and all the columns from both the data sets combined.

So our result data set would look something like this let’s now see what is the outer join. In many cases, you might not want to omit a single record, even if that means some rows will have missing information. So the outer join will combine all the records and append the observations from additional columns where it finds a common key. An outer join of these two data sets will have all the employees from both the data sets. Wherever it finds a common ID, it appends the column values, as we have seen in case of inner join, and for all the other, there will be missing values. So employee IDs one, three, four and seven are found in both the data sets.

So you have values for all the three columns for these employees. However, employee ID two and eight do not exist in data set two. Hence, we will see missing values or null in the department column. Same goes for employee ID nine and ten with missing values for salary. All right. The next type of join supported by Azure ML is left outer join. As the name suggests, it takes all the records from the left or the first data set and joins the matching records from the second data set. The observations where it could not find a match are left with missing or null values.

So in our case, it will look something like this. All right, the last in the series of data joins is the left semi join. As the name suggests, it extracts the records from the left or first data set, where it finds a match with the right or second data set. However, in this case, it does not add the columns or values from the right data set. So our result data set will look something like this. This can be particularly useful if you want to create a filter based on the records in the second data set. I hope that explains what are different types of joins supported by Azure Emil and what is a joint. In the next lecture, we will see the implementation of these different types of joints in the Azure Amel studio. So see you soon and have a great time.

  1. Join Data – Experiment

Hello and welcome. In the previous lecture we learnt about various types of joints and why we require them. In this very quick lecture, let’s see how we can join the data using various types of joints. So let’s go to the Azure ML studio and implement them. To save some time, I have already uploaded these two data sets which have the same data as we saw in the previous lecture.

You can visualize it after uploading it to yourself, and you can also download these data sets from the Post material section. Next there is the join data modules. You can search it and then drag and drop it onto the canvas. As you can see, it takes some five parameters. Every joint data module requires two data sets as input and produces an output or result data set. Using the launch column selector, you can specify the key column for left data set as well as for the right data set. Select this checkbox if you want it to match the case of the column names.

You can choose the joint type using this drop down which provides the options of inner, left outer, full outer and left semijoins. For quick explanation. What I have done is I have copied and replaced the joint data modules with different options of the joint type. The common key is employee ID and we can run the experiment to see the output.

Alright, it has run successfully and let’s visualize the output of the first one. It is as expected and the result data set has the four rows where employee ID matches in both the data sets. You can build your own joint data experiment and run them for different joint types and visualize the output for now. This ends the lecture on Join data experiment. I suggest you have your own data set and run this join data experiment multiple times on various keys. Thank you so much and have a great time ahead.