Google Professional Data Engineer – Ops and Security part 1

StackDriver

Here is a question that I’d like us to keep in mind as we go through the material in this video. Only Gcp projects and resources can be monitored using Gcp Stackdriver monitoring service. Is this statement true or false? We’ll come back to the answer to this question at the end of the video. Hello and welcome to this module which has to do with operations on the Google Cloud platform. A lot of the stuff that we are going to discuss in this module will have to do with Stackdriver. So let’s start with just a quick word of introduction about that company. Stackdriver is a firm which was bought over by Google in 2014 that specialized in DevOps products running a top of cloud platforms. A bunch of Stackdriver services are available to users of the Google Cloud platform.

For instance, Stackdriver monitoring, which is maybe the most important, collects metrics and metadata from Gcp, AWS and other cloud providers and puts them into nice dashboards charts and alerts that you can easily consume. Getting a Stackdriver account is a slightly different and additional step than just using the Gcp. Stackdriver is available in two service tiers Basic and Premium. You can start with a 30 day free trial of the Premium service. So let’s plunge in and take a look at how Stackdriver monitoring actually works. The basic idea is that there is a Stackdriver account which monitors a whole bunch of different projects. Those projects could be either Gcp projects or AWS projects.

Stackdriver is also smart enough to interface with Cassandra, Ngenx, Apache, that’s, the Web Server, Elasticsearch and many other commonly used apps from all of these cloud platforms, or common application components, or even from hosted uptime probes, or from instrumentation which has been inserted into code. Stackdriver will collect metrics, events and other useful information and present them in the form of dashboards alerts, uptime checks or other easy to consume methods. Remember that Stackdriver is now a Google owned company and so a certain specific structure needs to be followed. If you would like to interface with or monitor AWS projects, here is a schematic representation.

At the start of all of the action is a Stackdriver account. Of course then is a hosting project which is highlighted in blue. The Stackdriver account is a GCP resource and this account therefore needs to reside in some billable project. That project which hosts the Stackdriver account is called the hosting project. So for you as the consumer of the Stackdriver monitoring service, the project where you will be looking at the dashboards, checking out the alert policies, the checks, all of that will be done via the hosting project. Now, if all that you need to do is to monitor one project, it’s perfectly okay to create the Stackdriver account connected with that project alone.

So you just have one standalone project inside which there is a Stackdriver account which keeps track of all of the resource usage of that project. If, on the other hand, you would like to monitor multiple projects in Gcp by using one Stackdriver account, create a new Stackdriver account in an otherwise empty hosting project. Do not use the hosting project for any other purpose than simply to host the Stackdriver account. This is just a quirk of the way Stackdriver works. Things get even more complex if you would like to interface or monitor AWS accounts. In that case, what you need to do is to create AWS Connector projects. Thankfully, this is not something that you need to manually do.

All that you need to do is to add an AWS account to your Stackdriver account. Stackdriver monitoring is smart enough to figure out that you’d like to monitor a project that sits on a different cloud provider. It will go ahead and create a connector in the form of the AWS Connector project. That AWS Connector project presumably already has some monitoring and logging agents on their EC two instances, and those monitoring and logging agents will send their metrics and their logs to this Connector project from where it gets passed on to your Stackdriver account and to your dashboards. Notice though, that these updates, these reports on logs and so on, are going to be in the AWS Connector project and not in the host project of the Stackdriver account.

That’s an important little detail to keep in mind. Another important detail is if you are going to use an AWS Connector project, do not place any additional Gcp resources in this project. Recall that this AWS Connector is going to be set up for you by Stackdriver, so it’s going to do a whole bunch of behind the scenes stuff. Don’t interfere with those AWS Connector projects by adding Gcp resources into them. They will not be monitored for such Gcp only resources. Just make use of the regular monitored projects which are available within Gcp and which can easily be monitored by Stackdriver. No issues at all. This is a pretty high level overview of how Stackdriver monitoring works.

Let’s now understand the kinds of metrics that this monitoring can collect for you. Stackdriver comes with hundreds of preconfigured metrics. For instance, the CPU utilization of VM instances, the number of tables in SQL databases, so on and so forth. In addition to this plethora of preconfigured standard metrics that Stackdriver allows you to calculate, you can also create custom metrics and ask Stackdriver monitoring to track them. Custom metrics can be of three types gauge Metrics these represent instantaneous measurements of some quantity, such as, say, CPU Utilization Delta metrics, which represent changes between instantaneous measurements.

So these need the tracking of current and past measurements, and finally, cumulative metrics, which, as their name would suggest, need to operate over a window or some kind of batch of points. Clearly, Stackdriver can potentially collect a lot of data for you. This metric data will be available within Stackdriver monitoring for six weeks. You might be wondering what the latency of all of these metric measurements is, so let’s quickly talk about that. For really important utilization metrics such as VM CPU utilization, these are updated once a minute and you can pick these off or monitor them with a three to four minute lag. That’s a pretty good SLA.

You can also write metrics programmatically to a metric time series. The first time you do this, it will take some time, a few minutes maybe to show up, because there is a lot of initial setup required. Subsequent writes programmatically ought to show up in your update stream in a matter of seconds. As already discussed, you can monitor pretty much anything using Stackdriver. Because Stackdriver is a Google owned company, it’s pretty obvious that you’d be able to monitor pretty much anything on Gcp, VM instances, errors, traces on App engine instances, or Container instances. All of that is monitorable. It is also possible to monitor a wealth of stuff on AWS, EC Two instances, RDS databases, and so on. You should know that it’s also possible to monitor cloud storage buckets. So it’s not merely compute or consumable services that you can monitor, but just about anything that’s available for use in Gcp. You can set up alerts so you can ask to be notified if certain conditions are met. And there are lots of options for these notifications. Every conceivable means of communication email, phone, SMS, Slack, http web hooks. One bit to note is that these options depend on your service tier, so the more sophisticated the notification system, the more you’ll have to pay.

Let’s also understand now how Stackdriver’s Error Reporting syncs with the different compute options available to us. For instance, in an App Engine standard environment, log entries with the Stack Trace and with the severity of error or higher will automatically show up in Stackdriver’s Error Reporting. You don’t need to carry out any explicit instrumentation in order to make this happen. Similarly, for a flexible environment, anything which is written to the STD error stream output stream will automatically show up in your error reporting. If you are using a Compute engine instance, you will need to instrument your code. For instance, you’ll need to set up Try Catch Blocks and then in your catch block you will need to write to the error stream using the Stackdriver client.

And if you’d like to pick up error information from an Amazon resource such as an EC two instance, you’ll need to enable Stackdriver logging and then make use of the AWS connector accounts which we’ve previously discussed. Another great bit of functionality that comes along with Stackdriver is Stackdriver Trace. This is both a distributed tracing system which will collect latency data from App engine instances, load balancers and any application that’s been instrumented with the Trace SDK and then render it in really beautiful and appealing visual representations. If you ever use Tensorboard, which is a visualization aid which works with TensorFlow, which is a deep learning framework also made available by Google.

Stackdriver Trace is very similar in look and feel as well as in the way it works. Here is a screenshot that shows what Trace looks like. You can see that it’s a pretty sophisticated application. There are three important concepts that you ought to keep in mind while working with Stackdriver Trace. A trace in the Stackdriver world refers to an incoming request to your application and then the various events which are usually RPC calls that happen in response. Stackdriver Trace will make very precise measurements of the timings of these response events such as the RPC calls. Each of those events which happen in the response will be represented using spans, that is, time spans. And that brings us to the second concept worth keeping in mind.

A span is a component of a trace. So it’s like a step in a timeline which represents one RPC call which took place as a result of the incoming request. And the whole point of Stackdriver Trace is to also measure all of the metadata attached to a particular span. For instance, what version of the service was the span being executed on? What input parameters were passed into those grpc calls? Information like this is contained in annotations, and annotations are the third important bit that Stackdriver Trace concerns itself with. And in this way, Stackdriver Trace can be used for situations or for answering questions like how long does it take for your application to handle incoming requests either from users or from other applications?

How long does it take to complete specific operations, specific RPC calls which are performed in response to those requests? What is the round trip time for calls to app engine services like data store, URL fetch or Mem cache? All of these are very standard use cases for Stackdriver monitoring. Let’s come back to the question we posed at the start of this video. This statement is false. Remember that Stackdriver was a kind of cloud devop service which was set up and bought by Google only a few years into its existence. And that’s why Stackdriver is indeed able to monitor other cloud providers products as well. For instance, you could easily add an AWS resource such as an EC two instance or an RDS database and monitor it using Stackdriver monitoring.

StackDriver Logging

Here is a question that I’d like us to keep in mind as we go through the contents of this video. The question is how are audit logs and data access logs different from each other? Closely related to Stackdriver monitoring is another Stackdriver service. That Stackdriver Logging. Stackdriver Logging is a service which includes within it storage for logs, a user interface for log viewing, and an API to manage Logs programmatically. Stackdriver allows us to read and write log entries. Now, a log entry records a status or an event. This entry could be created by Gcp Services, AWS Services, third party apps, or your own apps.

The message carried by a log entry is called the Payload, and it could be as simple as just a regular string, or it could involve even complex structured data. Your project is likely to get a lot of log entries from common services such as Compute Engine or Bigquery. You will also get log entries if you connect Stackdriver to AWS or if you install Stackdriver Logging onto your VM instances. You can also explicitly write log entries by using a Stackdriver Logging API. Log entries will be retained within Stackdriver Logging for a limited retention period and deleted after that. This retention period is typically less than a month, depending on your service tier.

If you’d like to hold on to your log entries for longer, you will need to export them into some outside sync search and filter those logs for useful information using pretty complex predicates. Stackdriver actually has something called the Stackdriver logging filter language. This is a pretty simple language, but it allows you to create advanced log filters. These will be used in the Logs viewer. You can use the Stackdriver Logging API to select log entries at very low levels of granularity. For instance, you can say that you are only interested in log entries from a particular VM instance or those that arrived in a specific time span and have a certain severity level.

As discussed, log entries will only be retained within Stackdriver for a pretty short length of time. That retention period is less than a month or so, so you’ll need to export them to carry out any meaningful analysis. Log entries from Stackdriver Logging can be exported to a variety of syncs, including Google cloud storage buckets, bigquery data sets, and Pub sub topics. Exporting logs involves configuring log syncs, which will then go ahead and actually export the log entries as they arrive. So a sync includes a destination as well as a filter which picks which log entries are to be exported, and lastly to create log based metrics which you can then keep tabs on using Stackdriver Monitoring.

In addition to the log messages that you or your applications explicitly write for performance monitoring, there also are a whole bunch of audit logs. These are permanent Gcp logs. There’s no retention period, and these are maintained for various tracking purposes. Some, but not all of these are freely viewable by everyone. There are two primary types of audit logs admin activity logs and data access logs. Admin activity logs are always enabled. You don’t need to anything to start the collection of Admin Activity log information. Admin Activity Logs include all administrative actions which modify the configuration or metadata of resources.

Let’s say, for instance, that VM instances or app engine applications are created, or if their permissions are changed, log records for these will automatically be created within the Admin Activity Logs. These can then be viewed using the appropriate access roles, using Identity and Access Management in the logs viewer or even in the Project viewer. Admin Activity Logs are an important part of answering questions like who did what, where and when within a GCP project. The other major type of audit logs are data Access Logs. The primary function of data Access Logs is to audit who accessed data, as the name would suggest. So they will log API calls that create, modify or read user provided data.

These are usually disabled by default. An exception is bigquery, where data access logs are always on. In general, data access logs are not enabled because they can become quite large. If you want to turn these on, you can do so, but your project will be charged for the additional logs usage. This is different from Admin Activity logs, which are always on and not charged. Some data access audit logs can be marked as private because they contain sensitive personal Identifier information. To read these logs, you will need to have special permissions, or you’ll need to be an owner of the project containing them. Stackdriver Logging, just like Stackdriver Monitoring, needs to be associated with a Stackdriver account, and Stackdriver accounts are assigned.

Service Tiers the basic level of logging functionality which you can get without a Stackdriver account is free, but it has a five GB cap. Beyond that, you’ll need an account and the retention period and the types of options that you get for both logging and monitoring will depend on the service tier. As we’ve already discussed, you can monitor pretty much anything using Stackdriver Logging VM instances, AWS EC two instances, database instances, your own custom apps, just about anything. Recall that you will only have your log data held within Stackdriver Logging for a short retention period to use it. After that, you’ll need to export it, and you can export it to pretty much any tool of your choice.

Bigquery or Pub subtopics jump to mind. And also keep in mind the close link between Stackdriver Logging and Stackdriver Monitoring. We’ve discussed how metrics can be defined to be traced or visualized in Stackdriver Monitoring, and we saw how those metrics could be gauge metrics or delta metrics or cumulative metrics. All of these can be created using Stackdriver logs. So you can export data from Stackdriver Logging and set up metrics and visualize them using Stackdriver Monitoring. Let’s revisit the question. We posed at the start of the video. Audit logs are types of log entries which are enabled by default. You don’t have to do anything explicitly in order for these to start, and you are not built for audit logging.

Data access logs are a type of audit logs. So data access logs are a subset of audit logs. Data access logs refer to data accessed by users as their names would suggest. With the exception of Bigquery, in most tools or in most applications, data access logging is not turned on by default. Now, if data access logs are a subset of audit logs, what are the other types of audit logs? And the answer is admin activity logs. Admin Activity logs are also audit logs. These are always on by default. These log any actions which modify the state or the configure the metadata of an app. Data access logs need to be explicitly turned on with the exception of Bigquery. And Admin activity logs are always on.

Lab: Stackdriver Resource Monitoring

This is a lab to get you familiar with monitoring resources using Stackdriver. We will be spinning up some VM instances to monitor, following which we will enable Stackdriver and then take a look at some charts and dashboards. We will also go on to create alerts resource groups and some uptime checks. To begin though, let us bring up the Google Cloud shell and once the shell is up, we will be executing some commands in order to provision some VM instances which run Nginx. Though you can see the command on your screen, it is also included in a file which is attached to this video. So once this command is executed, we will have exactly three instances which have been created out of an Nginx image.

And once that is ready, let us now go ahead and create a Firewall rule which will allow external traffic to these instances. So this is the command to execute for that. Again, this is also attached to the video. And now we have our instances and our Firewall rule in place as well. The next step for us is to go on and create a Stackdriver account. So we navigate in the menu to monitoring and we will now be given the option to create a Stackdriver account once we have logged in as a user, and once we get the option to connect a project to our account, we just choose to create a new account. We can add additional projects to the Stackdriver account, but let us skip this for now.

We can also link an AWS account for monitoring to the Staggerver account, but let us also skip that. This page tells us how to install Stackdriver agents on some resources if we want to monitor them. And over here, let us just say we do not want to get reports by email. Finally, we just wait for Staggerver to gather a bunch of metrics and when it’s ready, hit launch monitoring. So we will need to specify that we wish to continue with the Stackdriver trial. But once we do that, let us go ahead and create our first dashboard. So we hit Create Dashboard, give this one a name called Archinfer Dashboard, and then let us go and add a new chart so we can specify the metric which we wish to monitor.

So this is going to be the network outbound traffic and the resource type is going to be our instances. Once the chart loads, we can see the outbound traffic from each of our instances moving along to the advanced options. We can do things here like set a threshold line so that we can visually see if any of our resources have exceeded a certain threshold for the metric we wish to monitor. Rather than have individual lines for each of our resources, we can also choose to aggregate their values and we can also apply some filters. So let us go ahead and see what our options are as we can see we can filter our output by name and resource ID, but let us just pick one and save this chart.

And once it is done, let us just go and take a look. So our graph has loaded and we can see each of our instances here. Moving along though, in case we do not want to create a Dashboard and we just want to view the metrics for our resources, we have something called a Metrics Explorer. So let us navigate to Resources and Metrics Explorer. Over here. We can pick the specific resource and the metric which we wish to analyze. Browse through the list of available metrics and we pick to analyze the CPU utilization across our instances. As you can see, just like in Dashboards, we can apply some aggregation of filters over here as well.

Let us move along now and create an Alerting Policy. So we navigate in the menu to Alerting and create a new policy. And now specify the conditions under which we are alerted. So for our first condition, specify the metric threshold. So the resource this condition applies to is an instance type and it applies to a single Nginx instance. The metric which we wish to monitor is CPU usage. And the condition under which we are alerted is if the usage is above 20% and if this goes on for let’s just say 1 minute. Now that we have saved this condition, let us go ahead and add a second one. So this condition is going to be somewhat similar to the one we just created.

So we add another condition, also a metric threshold, and in this case choose to apply this to our second Nginx instance. The metric we will monitor is once again, CPU utilization. This time though, the threshold keep it above 10% for 1 minute. Once we save this condition, we have two separate conditions in our Alerting Policy. Let us now pick a policy trigger. So we would like to be notified when both these conditions are met and we would like to be notified by email. So we add that notification and specify an email address. And finally, let us just give our Alerting Policy a name. And once we have done this, we just hit Save Policy and we have now created our first Alerting Policy.

Let us move along now and create a resource group. So for that, we navigate in the menu to Groups, choose to create a new group. Call this one GCE central. And in the filter we will specify the criterion that we want this to pick up all resources in the US Central Region, so hence the name of the group as GCE Central. Once this is done, we hit Save Group. So the instances which were initially provisioned were all in the US Central Region. So let’s see if these have been picked up. And when we go to the Dashboard, yes, we see all the Nginx instances over here.

So one thing we can do when we have a resource group is to set up uptime monitoring. So let us go ahead and add an uptime check, select to add a new check and we wish to check instances in this group on the Http port to just to see if they are listening to Http connections. So we pretty much leave most of the options as is and just specify a name and hit save. We will be adding if we want to set up an alerting policy as well, but let us just decline this offer and now we have set up uptime monitoring as well. With that we have come to the end of this lab on resource monitoring using Stackdriver.

Google Professional Data Engineer – Ops and Security part 1

Related Posts