Amazon AWS Certified Data Analytics Specialty – Domain 2: Storage

DynamoDB Security

So now let’s talk about DynamoDB security. Basically, you have VPC endpoints to access DynamoDB without the Internet. You can get access fully controlled by IAM. You can get encryption at rest using KS, encryption in transit using SSL and TLS. So it’s basically as secure as any AWS service. You get backup and restore feature available, so you can do a point in time. Restore? Just like RDS, there’s no performance impact. We can do backup and restores. You can define global tables, which are multiregional, fully replicated high performance tables.

And you can use DMs that we saw before to migrate data to DynamoDB. For example, from MongoDB oracle, MySQL s three, et cetera. We can also launch a local DynamoDB database on our computer if we need to do development or just test a few things out. Okay, so that’s it for DynamoDB. I hope that was helpful. I hope that was a well rounded introduction to it, but should be enough for your big data exam. All right. I will see you in the next lecture.

DynamoDB: Storing Large Objects

So let’s talk about a classical solution architecture with DynamoDB, which is how to store large objects. So you should know by now that the maximum size of an item in DynamoDB is 400 KB. That means that if you have a large object, it could be like a file, it could be whatever you want, then you cannot store them in DynamoDB directly. So what you have to do for large objects is to store them in Amazon S Three and then add a reference to them in DynamoDB. How does that work? Where our client, and this is something we have to code, obviously, will have to upload a large object in Amazon S Three where it wants to upload it, and then it will create a new row in DynamoDB indicating the item has been uploaded. So you may be asking me what this row looks like.

Then it can look like this, with an ID, a file name, an S Three URL, and a file size megabyte. For example, if you want to add some metadata alongside the data in DynamoDB, and then the client, to retrieve that content, we need to read the row from DynamoDB. And then it would look up the S Three URL from the row and say, okay, I need to get that object from S Three, in which case I will download large objects. And so the idea is that here we can have large objects of gigabytes, if you wanted to, into Amazon S Three and have them referenced into DynamoDB. So that means we efficiently use DynamoDB for what it’s good at, and we use Amazon S Three for what’s great at as well.

What I want to show you is that even if you have an object of 300 KB, if that object is underused, it’s going to be a better architecture to use the one from the left hand side than the one from the right hand side. So for 300 KB, we still have the option to store the object entirely into DynamoDB because the maximum size is 400 KB. But let’s assume that this object is not accessed very often. So here is some competition. I’ve done, and you can look up the numbers later. But for Amazon S Three, this is going to cost you this much for storage, this much for putting the object in Amazon’s Three, and this much for getting the object in Amazon S Three per get. And then for DynamoDB, because now the object is not in DynamoDB ten S three, then you have a very small row in DynamoDB.

So it’s going to be less than 1 storage, which is going to be very little, very not expensive for WCU and RCU and also storage. So if you assume one right and 100 reads per month, this is going to cost you zero point 1190, 215 per month. So these numbers are very, very slow because I’m talking about low, because I’m talking about one item of 300 KB. Obviously, it takes a lot more information once you have thousands of objects or 300 KB into DynamoDB. Okay, next, if we store everything into DynamoDB, then the numbers look like we have to provision 300 write capacity unit per month just to be able to store that object in DynamoDB, and another 38 read capacity unit per month just to read that object. Then you have storage cost.

And then if you assume one write and 100 reads per month, I’ll give you the results. The storage is going to be eleven times more expensive. And if you use WC and RCU, they’re going to be extremely more expensive. And why? Because they’re under used. So the general wisdom from it is that even for items that fit in DynamoDB, if you underuse these items, I mean, if you don’t read them very often, then S Three plus DynamoDB is a better solution. So what’s the wisdom here? So, if we have items that are be read just very little, and we want to store them somewhere, amazon S Three is going to be a better data store with a reference in DynamoDB. But if you have an item that needs to be reviewed very, very often, and it’s a small item, then DynamoDB is going to be a better answer than S Three in this case. Okay? So I hope that was helpful and I will see you in the next lecture.

[Exercise] DynamoDB

So for our next handson activity, we’re going to continue to build out our order history app. This time, we’re going to introduce DynamoDB into the mix and illustrate some of the stuff that we’ve learned so far. And for now, we’re not going to use AWS lambda because we haven’t covered that yet. Instead, we’re going to use a custom consumer script that will just sit on our EC Two instance, listening for new data on our data stream that we created earlier and funneling that into an Amazon DynamoDB table. Later on, of course, when we cover lambda, we’ll fill out that piece as well. We’re just getting there one step at a time. So for now, let’s build our Amazon DynamoDB table and create a consumer script to fill the gap between the data stream and DynamoDB. So first thing we need to do is go to our AWS management console I’ve already logged into mine here and select DynamoDB. Type it in here if you don’t see it in your list. And we’re going to create a table.

We’re going to call it cadabra orders. Again, pay attention to capitalization and spelling. Our partition key is going to be Customer ID, spelled just like that, and that is a number. The reason being that our order application is meant for an individual customer to look up his or her order history. So it makes sense to partition our data by that customer ID, so DynamoDB can very quickly retrieve the information for a given customer as a whole. We’ll also add a sort key, because customer ID isn’t unique enough for us, right? There could be multiple records associated with a given customer ID representing individual items that they’ve ordered. So the partition key by itself is not sufficient to provide a unique key. So instead, we’ll add an order ID as well that represents an actual line item order ID. And we’re going to have to fabricate that.

As you’ll see shortly, that will remain as a string. We can keep all the provisioning set to the default settings. We are within the free tier of usage here. For DynamoDB. You are allowed way more data and way more capacity than we’re going to use right now. So as long as your account is relatively new, don’t worry about being billed for this. Hit create, and we just have to wait for that table to be created. Meanwhile, let’s set up our consumer script. So let’s log into our EC Two instance. However you might do that on your system. I will log in as EC Two user. All right, so the first thing we need to do is install the Bodo Three Python library. This is a library that AWS provides that makes it easy to write Python code that communicates with AWS services. To do that, just type in pseudo pip, install boto Three, just like that. I’ve already installed it in my system, but on yours, it will do something more interesting there and actually install something. We also need to create some credentials files. This is so boto Three knows how to log into AWS using your account credentials, and also what region it will be in. First step is to create a dot AWS folder in your home directory. Let’s make sure we’re there first. So let’s go ahead and say maker dot AWS. We’ll CD into it and create a file called credentials in there. So nano credentials. And in here we’re going to type in square brackets default AWS underscore access underscore key underscore ID equals whatever your access key is for the IAM user that you created earlier in the course. If you don’t have that handy, you can copy it out of the credentials that you put in for the agent JSON configuration file for Kinesis.

Or you can always create a new IAM user for this. Worst case scenario, I scrolled away a copy of it here. So I’m just going to copy that in here with the right click that I copied it from. And the next line will be your secret key. That will be AWS underscore secret underscore access underscore key equals whatever your secret access key is. And again, don’t get any cute ideas about using my credentials because they are not going to exist by the time you see this video. Go ahead and hit control o to write that out. Enter and control x. We also need a file called config in here. So let’s say nano config.

Again, we’ll do in square brackets default. And we will type in region equals whatever region you’re setting up your services in for me. That is us. East one. All right, control o. Return control. All right, let’s CD back up to our home directory. There we are. And let’s download our actual consumer script. So wgethtpcolanmedia sundog soft. com AWS big data consumer PY. Pay close attention to spelling and capitalization.

They both matter. All right, so our script got downloaded successfully. Let’s take a quick look at it and see what it does. Nano consumer dot PY. And you can see we’re importing that Bodo Three library that we just installed. And we are creating a client for Kinesis that is connected to the Cadabra orders Kinesis stream that we created earlier in the course. And we’re creating a client for DynamoDB that’s tied to the new Cadabra orders table that we also just created. Then we just sit here in this endless loop until we actually break out of this application. And it just sits there waiting for new records to come in from Kinesis from our Cadaver order stream. And if it finds a new record, it parses out all the information. As you might recall, our Kinesis agent actually converted all of our source data from CSV to JSON for us. So we just have to parse out that JSON data here.

It will print out each record just so that we can see its progress and suck out the data invoice customer order, date, et cetera, based on the fields in that JSON record that it received from the Kinesis stream, we’re also going to construct a unique sort key for this line item. There’s no actual order ID that’s unique to each line item in this data set, so we have to make one up. And to do that, I’m going to fabricate one by just concatenating the invoice number and the stock code.

The idea being that you’re not going to have more than one line for a single stock code on a given invoice. So this should suffice for creating a unique sort key for every item of data here for us. After that, we call table Put item to actually insert this data into DynamoDB, and that’s pretty much it. It just sits there and sits in an endless loop, waiting 1 second between each one so we don’t hit any capacity issues on timeouts. Pretty straightforward stuff. Right, so let’s go ahead and hit Control X to get out of this and see if it works.

First, we have to make this script executable, so change mod a plus X, consumer dot PY, and now we can just run it. So dot slash consumer dot PY. And now it’s just sitting there, waiting for new data on my Kinesis stream. So let’s give it some data to play with and see what happens, shall we? I’m going to right click on Putty and open up another terminal window to my EC Two instance. We’ll put them side by side here so you can see them both together. We’ll log in again as EC Two user, and let’s go ahead and run pseudo loggenerator PY ten. That will insert ten rows of new data into our log files, and the Kinesis agent should pick that up. And in turn, our consumer script should pick that up and put them into DynamoDB. So let’s kick that off and see what happens.

We’ll take about a minute for the agent to catch up with things and notice that that new data exists and process it. So let’s just be patient for a bit here and see what happens. We’re watching the leftmost window here to see if the consumer script actually picks up that data and does something with it. All right, looks like something happened there. That’s pretty cool. So apparently our script did pick up that data and I don’t see any errors, so that’s a good sign. Right, so let’s check out our DynamoDB table and see if we have any shiny new data in there that it actually put in. Let’s go back to our AWS console here, and we’re still in the screen here for our Cadaver orders table. If we click on items, there it is. That’s pretty cool. All right, so there’s ten rows of data, right? Yeah, looks about right. And looks like it all came through in one piece there’s our primary key of customer ID coming in our sort key of order ID. Together, these provide a unique key for each row.

And it looks like the rest of the data came in successfully. We have valid looking countries descriptions, order dates, quantities and unit prices. So cool. It all worked. We have actually created a system that works from end to end where we actually monitor new information being output into a log directory on an EC Two host. A Kinesis stream picks that data up. We then have a consumer agent that’s actually running on another EC Two, or in this case, the same EC Two instance that then turns around and inserts that into DynamoDB. And you can imagine a mobile app that could talk directly to this DynamoDB table to return order information for a specific customer very quickly. So we actually have something here that works from end to end. Like I said, later on we’ll come back and replace that consumer script with an actual lambda function, which will be even more scalable. But congratulations. Built something real here. That’s pretty cool.

ElastiCache Overview

So to be honest, Elastic Cache does not come up very often in the exam. But I’m still going to introduce it to you just so you know what it is. Well, in the same way, use RDS to get a managed relational database, use Elastic Cage to get a managed Radis or Me cache. And they’re basically caching technologies. They’re in memory databases and they have really high performance, really low in C, basically to cache objects. That helps you reduce the load off of your database when you have read intensive workloads. It helps you make your application stateless. And you even have right scaling using Shading read scaling using Read Replicas mulitas for failure capability. And AWS will take care of all the OS maintenance, patching, auto optimization, setup, configuration monitoring, failure recovery and backups. Basically, this is RDS, but for caches. So we have two caches technology. You don’t need to remember them too much, but just as a high level, it’s fine.

There’s Radis and me caches. So Radis is an in memory key value store has super low latency sub millisecond, and the cache will survive reboots by default. It’s called persistence. So your cache is persistent.

It’s great when you have user sessions leaderboard distributed states when you want to relieve pressure on your databases, such as RDS, when you want to have Pub sub for messaging ready. This is going to be a great use case for that. You can also enable mulitas with automated failure. I already told you this for the recovery, in case you don’t want to lose your cache data. There’s also support for read Replicas, so this really, really scales well. And me cache. G is another technology. It’s more used for objects like S Three, so the cache doesn’t survive reboot this time.

So the use case would be a quick retrieval of objects from memory when the cache is often accessed. Overall, I would think that Radis now has an overlapping feature set over member Mache D and it’s more popular. So if you need any caching, I would strongly recommend Radis instead of Mach. So that’s all you need to know going into the exam. Some very architectural aspects about Elastic caching. That’s why I’m pretty brief on it and I don’t do any hands on, but you get the idea. It’s an inmemory cache, high performance, good for caching, database writes or reads, that kind of stuff. All right, all right, I will see you in the next lecture.

Amazon AWS Certified Data Analytics Specialty – Domain 2: Storage

Related Posts