Pass Amazon AWS Certified Data Analytics - Specialty Exam in First Attempt Guaranteed!
Get 100% Latest Exam Questions, Accurate & Verified Answers to Pass the Actual Exam!
30 Days Free Updates, Instant Download!
AWS Certified Data Analytics - Specialty Premium Bundle
- Premium File 140 Questions & Answers. Last update: Nov 24, 2022
- Training Course 124 Lectures
- Study Guide 557 Pages
Last Week Results!
|Download Free AWS Certified Data Analytics - Specialty Exam Questions|
Size: 220.49 KB
Size: 171.06 KB
Size: 175.77 KB
Size: 79.16 KB
Size: 79.39 KB
Amazon AWS Certified Data Analytics - Specialty Practice Test Questions and Answers, Amazon AWS Certified Data Analytics - Specialty Exam Dumps - PrepAway
All Amazon AWS Certified Data Analytics - Specialty certification exam dumps, study guide, training courses are Prepared by industry experts. PrepAway's ETE files povide the AWS Certified Data Analytics - Specialty AWS Certified Data Analytics - Specialty (DAS-C01) practice test questions and answers & exam dumps, study guide and training courses help you study and pass hassle-free!
Domain 2: Storage
10. Glacier & Vault Lock Policies
So now let's talk about glaciers. If you remember, Glacier was the last tier in that tiering storage comparison table. And now we're just going to do a deep dive into the features of Glacier because it's way more than just a storage tier. So what is it? Well, it's definitely low-cost object storage, and it's meant for archiving and backup, and the data will be retained for the longer term. We're talking about tens of years, right? It's an alternative to running magnetic-tech storage within your infrastructure. It has high annual durability. Eleven nines and a very low cost per storage. It's around 0 0 0 4 / GB plus the cost of retrieval for each item.
In English, this will be called an archive and will be up to 40 terabytes, and archives will be stored in a vault. As a result, the naming conventions differ from those in S 3. So, an example tip would be that any time we want to archive from S3 after x number of days, we would use a lifecycle policy and maybe use Glacier for that. Now, for Glacier Operations, remember that links have an expiry date, and we have three retrieval options: expedited, standard, and bulk. and each of them basically depends on your requirements.
One, two, or five minute retrieval is going to be expedited, but way more expensive. And all the way at the bottom, the bulk will wait between five and 12 hours but will be way less expensive. Okay, the one thing I want you to remember about this lecture, though, is this feature called Vault Policies and Vault Lock. So, a vault, as I said, is a collection of archives, and each vault will have a bunch of archives and also one vault access policy and one vault lock policy. So, both of these policies will be written in JSON.
So they're very similar to a "bucket" policy. So the access policy, the vault access policy, is very similar to a bucket policy, as I said, and will allow us to restrict user and account permissions. But this luck policy is a new type of policy that we haven't seen in Section 3. And so, for Glacier, what is the vault lock policy? What is the policy under which you choose to lock for regulatory or compliance reasons? And it's so particular because this policy, once it's set, cannot be changed. It's called immutable, and that's why it's called a "luck" policy. So once you send it, you set it, and you can never change it.
That is a guarantee that AWS will abandon you. So, the reason you would use this is number one: you want to forbid deleting an archive if it's less than one year old. That could be the kind of lock policy you want. And because the log policy can never be removed, you know for sure when you go see a regulator or when you want to show compliance that, because you're using these vault lock policies, you're compliant with this rule.
The second option is to implement a "write one, read many" policy, which means "write one, read many." And the idea that you can write once in the glacier and be certain that it cannot be overwritten later, So vault lock policies are definitely going to be asked at the exam in case you use Glacier for regulatory and compliance requirements. So remember, they exist. I hope that's enough for you. I hope you liked it, and I will see you in the next lecture.
11. S3 & Glacier Select
S3 is a feature that is extremely important for S3 and big data. Select English. Select. The idea is that you want to retrieve less data out of S3 in advance by using SQL to perform server-side filtering in advance. This way, you can filter by rows in my columns using simple SQL statements. And because the filtering happens directly on S3, this will result in less network transfer because less data will have to go over the network, resulting in lower client-side CPU costs on your own servers. So here's a diagram directly from the AWS websites directly. So you see that before we get all the data from Amazon S3 and then on our clients, we perform the filtering. But after using S3 Select, the filtering happens on Amazon S3, and less data is transferred to our clients. And the results announced by AWS are that it's up to 400% faster and up to 80% cheaper to use S Three Select.
So if you want to see a diagram, our client will say, "I want to get this CSV and I want to use S Three Select." Amazon S3 will take the CSV and perform server-side filtering, then send us back the filtered datasets. So why is it relevant to us? Because S3 can be used with Hadoop, Assume you have a use case in which you need to transfer data from S3 before analysing it with your Hadoop clusters. So you want to take the data from S3 and transfer it to Hadoop. Then you should use S Three Select to do some server-side filtering and only retrieve the filter data set that you require using S Three Select. That is to say, the columns and rows that you require. And so with S Three Select, you're able to select very simply the columns and rows you need. Transfer less data over the network and save a lot of money, especially if you are in a big data type of situation. So I hope that was helpful. I hope you understand the usefulness of this feature, and I will see you in the next lecture.
12. DynamoDB Overview
DynamoDB is a huge part of the storage section for the big data exam. Yes, indeed. You can store hundreds of terabytes of data in DynamoDB and expect great performance. So we'll see in this section how DynamoDB relates to big data. But first, the basics. DynamoDB is a fully managed database. It's highly available, and all your data is replicated across three availability zones.
It's a no-SQL database. That means it's not a relational database. It will scale to a massive workload, and it will be distributed. And that's what makes it scale. You can issue millions of requests per second. You can have trillions of rows and, as I said, hundreds of terabytes of storage. The really good thing about DynamoDB is that you get fast and consistent performance.
As a result, every retrieval will have a low latency, typically in the single digit millisecond range. It will be integrated with IAM for security authorization and administration, which is the best practise on AWS to manage security. Anyway, on top of it, we can enable event-driven programming for your DynamoDB streams, where we can respond to events that happen to our table and see in real time what we should do about it.
It's low-cost and has auto-scaling capabilities, which make it the perfect serverless database. So in terms of the basics, DynamoDB is a database, but you don't need to create one; one is already there for you, and it's made of tables. Each table will have a primary key, and the primary key, as we'll see, must be decided at creation time. Each table will have an infinite number of items; an item is also a row. But because an item can have nested values, we usually like to refer to it as a row.
Each item will have attributes, and they can be added over time. You don't need to define them at table creation time. This is a difference versus, say, an RDBMS database or a SQL database, and these attributes can be null. The maximum size of an item is 400 KB. So this is a really important decision when you want to know if you want to store data in DynamoDB or in S three.S3 can have up to five terabytes of data for an object, whereas DynamoDB is 400 KB. So be very careful about it. But in DynamoDB, obviously, you can have a million rows of 400. It will give you a great deal of storage. The data tabs that are supported are going to be string numbers, binary, boolean, and null. also list and map, and finally, some sets such as string sets, number sets, and binary sets. Now, let's talk about the primary keys. We have two options for this.
The first one is to use a partition key only, and it's called a hash key. That means that the partition key must be unique for each item, and it must be diverse enough so that the data is distributed. For example, if you have a user's table, A user ID is a fantastic idea for a parchment key. So we'll have our user ID, there will be a patching key, and because this is a user table, we are guaranteeing that the user ID will be unique for each user. And then the attributes for our users might be the first name and the age.
So these will be two kinds of rows we will have in our DynamoDB table. As we can see here, the user IDs are different for John and Katy. Okay, this is one option, but what if you want more information on this primary key? We can use something called a partition key combined with a sort key. So that combination this time must be unique, and the data will be grouped together by the partition key. And the sort key will be used to basically sort your data within that partition. It's also called a range key. So if you have a user's game table, we want to basically have a data set that stores all the games played by a user.
We can have a user ID for the partition key. That means that all the games belonging to one user ID will be within the same partition, and then we can use the game ID for the sort key. What will this look like? Well, this will look like this: My partition key is user ID, my sort key is game ID, and the combination of these two things will make my primary key. Maybe, in terms of attributes, we want to include the results. That gives you a good idea of how the table is designed. If we store individual rows, we can see that these two games for the same user, twelve whatever, will play two different games, and we have a win and a loss. But it's fine that the partition key has duplicate values because the combination of the partition key and the sword key is my primary key and needs to be unique.
Okay, I hope that makes sense. So you need to think about choosing a good partition key. Remember, it needs to be distributed. So if we're building a movie database, what is the best partition key to maximise data distribution? Is it a movie ID? Is it the name of the producer, the lead actor, or the language of the film? Think about it for a second and come up with the answer. Well, here, movie ID has the highest cardiacality, so it's a very good candidate.
If you were to choose, for example, movie language, it doesn't take many values and may be more skewed toward English, so it's not a great partition key. The same goes for the lead actor's name. Maybe you have one actor who has done many movies, or maybe you have a producer name as well. Maybe you have a producer like Steven Spielberg, who does so many movies.
OK, so the movie ID here will have the highest cardinality, so it's a very good candidate. Now, how does DynamoDB relate to the big data world? Well, there are some very common use cases that include mobile apps, gaming, digital arts, delivering live voting, audience interaction for live events, sensor network log ingestion, access control for web-based content, and metadata storage for Amazon.
These three objects include e-commerce, shopping carts, and web session management. So the use cases are quite diverse. But basically, that means that any time you have some data that is used to be very hot and needs to be ingested at scale within a database, DynamoDB is going to be great for that. In terms of Antipater, what don't you do with DynamoDB?
For example, if an application uses RDS, such as a traditional database like RDBMS, we will not use RDS instead because you do not want to completely rewrite it. If you need to perform joins or complex transactions, dynamic may not be your best friend.
Maybe again, an RDS database may be better for you if you need to store large binary objects or blob data. So, with big data, you know, maybe it's better to store it in S3 and store the metadata in DynamoDB. This is a very common pattern. In general, if you have large amounts of data with low-IR, such as very few writes or reads, S3 will be a much better storage option for you. So think about it: DynamoDB is going to be more efficient when your data is hot and smaller. S Three will be more when your data is a little colder but much larger.
13. DynamoDB RCU & WCU
Okay, so as part of the exam, you will need to compute your provisioned throughput. Basically, when you create a table in DynamoDB, you must provision read and write capacity units. What are those? Well, read capacity units are RCUs, and they define the throughput for reading.
So how many elements per second can we write? And the letter U has four syllables. Now we'll see the computation formulas in this lecture very, very soon. Now you can also set up autoscaling in case you don't want to figure out RCU and WCU in advance and you want it to go up and down based on the write and read patterns in your DynamoDB table. In addition, if you go over for a minute or two, you can temporarily exceed the throughput by using burst credits.
But if you use all your birth credit, then you will get a provisioned throughput. If you do get so, you should use something called an exponential backup retry to basically retry once in a while with exponential time to just make sure that your request eventually gets through. So let's talk about WCU. First, we need to be able to compute it, because the exam will ask you how to compute the WCU for a table of whatever throughput they give you. So one right capacity unit (WCU) represents one write per second for an item up to 1 KB in size. So this is the most simple one. It's because there's just one one.
So, if the item is larger than one kilobyte in size, more del UCU are used. So the best way to understand this formula is to go through examples. So if you write ten objects per second and each of them is two kilobits, you need two times ten to equal 20 WCU. Similarly, if you write six objects per second and they're all 4.5 kilobytes each, then whenever you get a comma, you're to round up. So we'll need six times five, or thirty WCU.
Finally, if you write 120 objects per minute and each object is two kilobytes each, then you need to bring this back to objects per second, which is two times the size of the interaction, which is 2 KB. So you have four double UCU. The read-capacity unit is the simplest one. So remember the formula; it's very important. Before we introduce the read capacity units, we need to take a step back and understand the difference between strongly consistent reads and eventually consistent reads.
So, eventually, consistency with read means that if you read just after a right, like three, there's a chance we'll get an unexpected response due to replication, possibly an outdated response. Whereas if you do a strongly consistent read, then if you read the data just after the right, we will get the correct data we just wrote. By default, DynamoDB is a distributed database, and so you use eventual consistency for reads. Forget item query and scan; they provide a consistentread parameter, and you can set it to true to get a strongly consistent read.
Okay, so how does that work? Well, for example, imagine that DynamoDB is distributed across three servers. If your application writes to one of these servers, it will do some replication, maybe to server two and then to server three. But when you do a read, it's not guaranteed that you're going to read from the server you just wrote to. So it's possible to do a read from, say, server three. And so this is why, if you don't request a strongly consistent read, you may get something outdated or the wrong result.
Now this is just something you should deal with in your application. It is sometimes fine to have eventually consistent reads, and other times it is not. And so you request some strongly consistent reads. So why am I introducing this concept to you? Well, because when you consider RCU, you need to understand if it's for one strongly consistent read or for two eventually consistent reads.
So let's read the definition together. One capacity unit represents one strongly consistent read per second. or two eventually consistent reads per second for an item up to four kilobits in size. And if the items are four kilobytes in size, more RCU are consumed. You're basically going to round up to the next 4 KB. Okay? So it's always better to go through examples. If the exam asks you how we achieve ten strongly constant reads per second of four kilobytes each, Well, this one is super easy. We divide ten times four kilobytes by four kilobytes.
Because remember, one RCU is equal to four kilobits in size. So ten RCU are required for this example. Example two is that if we have 16 eventually consistent reads per second and each of them is 12 KB, then we need to do a bit more math, which is 16 divided by 2, because we need twice as little RCU for eventually constant reads.
And then twelve is divided by two because each read is 4 KB in size, which gives us 24 RCU. Finally, if you have ten strongly consistent reads per second of six kilobyte each, what you'll do is round up the 6 to the nearest 4 kilobyte. So ten times eight divided by four equals 20 RCU if you multiply 8 by ten.
So be very, very confident with this formula. It's super important for you to be able to compute RCU and WCU, especially in a big data setting. Now, DynamoDB has some throttling, so if you exceed the RCU and LSU and go over the burst credits, then you will get a provisioned throughput exceeded exception. And the reason for this, if so, is maybe because you have a hotkey or a hot partition. So that means that one key is being read too many times.
So, for example, if you're an e-commerce store and you're selling iPhones, maybe your iPhone is a very popular item, and everyone wants to read the iPhone—at least the web page for the iPhone. And so your DynamoDB will start requesting the iPhone over and over again. And that's called a "hot key" or "hot partition." And so you will enter a threat link, or you may have some very, very large items because remember, RCU and WCU depend on the size of the items, and if they request too many times, you will obviously exceed your RCU and WCU, and you may receive a threat link. So solutions for those are, well, exponential backup whenever you encounter an exception that is already embedded in some SDK, or distribute the partition key as much as possible to make sure that your load is spread evenly between your partitions.
Finally, if you still have an RCUissue, say, with that hotkey on the iPhone, we might be able to use something called the DynamoDB accelerator DAX to solve that issue, and we'll see how this DAX solves issues using caching. Okay, so that's it for scaling; DynamoDB is really important. That makes sense for you. Let's quickly see the options in the UI now. OK, so let's go to the DynamoDB service. And in the DynamoDB service, I'm going to create a table. I'll call it a demo table. I'm going to use a partition key of user ID, and I can even add a sort key, for example, for GameId. What matters, what matters here, are the table settings. I will unpick the default settings and go to Read/Write Capacity Mode.
As a result, we can operate in either provisional capacity mode or on demand. On demand is a new mode, and basically that means you don't even need to provision RCU or RWCU; it will just provide it on demand. Now it's way more expensive, and the exam will not ask you to compute anything for On Demand because On Demand is basically compute less. So what we need to do is just make provisions, and here we can go into the calculations. So the provision capacity, as you can see, is for RCU or WCU.
And here they're locked at 5 because we have Odo scaling enabled. So oddscaling means that your RCU here will scale between five and 400 and will basically target a utilisation of 70%. Whereas in this case, the odour scaling for appropriate capacity is the same between five and 400, with a target utilisation of 70%. But if we don't want auto-scaling, we can untick those.
And then, in this case, we can set up the read capacity unit and write capacity unit manually. So, for example, we could say 2300 or something similar. In this case, when we set up 2300, it gives us an estimated cost of $348 per month, which is a lot. Now, the capacity calculator is a way for you to basically enter the average item size, for example, 15.
How many eventually consistent or strongly consistent reads, say, 60, and how many writes per second you want. So, for example, 30 It also recommends a read capacity of 240 and a right capacity of 450, as well as the estimated cost of your table. So it's a very, very handy tool, obviously, because I don't want to pay that much money. I'm just going to have five and five just by default, and that will be under the switch here for me, and I'll be done.
Now I can click on "create," and here we go. My table can now take, for example, five writes per second. So simple to understand, but I wanted to show you the capacity, calculator, and everything else so you can understand and practise on your own if you want to run through scenarios. But now I hope that makes sense about how you can scale a DynamoDB table, and I will see you in the next lecture.
14. DynamoDB Partitions
So something you need to absolutely understand when dealing with DynamoDB, especially in the context of big data, is how the internals of DynamoDB, especially partitions, work. So when you start with DynamoDB, for the table I just created, you are going to start with one partition.
And each partition has a boundary. It can only have up to 3000 RCU and 1000 WCU. And also, it can only contain up to 10 GB of data. So let's have a look. In here, I have a DynamoDB table, and it has three partitions. Okay? So we have our item with user ID one. And it will run a hashing algorithm on the partition key. We're not exactly aware of what the hashing algorithm is, but what we know is that the same key will go to the same partition.
And so when the user ID one is hashed, it's being sent to, for example, partition one. And if the user ID I-1 is encountered again, it will go back to partition 1. Maybe user ID 2 will go to partition 2, maybe user ID 3 will go to partition 2, and maybe user ID 4 will go to partition 2. Again, user ID five for partition one and user ID six for partition three So how do we understand how many partitions we're going to get? Well, basically, based on the limits of each partition, we can guess the number of partitions we're going to get. So there's a capacity, which is total RCU divided by 3000 plus total WC divided by 10. And then by size, you look at the total size of your table and divide by 10 GB.
And to get the total number of partitions, simply multiply the capacity we just calculated by the size. So, for example, if we have 50 GB of data, we should have at least five partitions. just to give you an idea. So something you should know, though, is that the WCU and the RCU will be spread evenly between each partition. So let's have an example here. What does it mean if we have a table set and say, "Okay, this table will have 6000 RCU and 2400 WCU?" When the WCU and RCU are spread evenly, that means that each of these partitions is going to get 20 RCU, because 2000 plus 2000 plus 2000 equals 60.
And then each of these partitions is going to have 800 WCU because 800 times three equals 2400. So when you understand this about DynamoDB, you really understand how it works. And you can really see the concept of a hot partition. For example, if I have ten partitions and my user ID one is very, very hot, then that partition only gets this amount of RCU and WCU. And we can see how we can exceed the throttle only for that low partition. So it's really important to understand this, because once you understand how DynamoDB distributes data and based on which constraints the number of RCU doubles or grows, you're able to understand the behaviour of your table. And this is what the exam expects you to understand. So I hope that makes sense, and I will see you in the next lecture.
Amazon AWS Certified Data Analytics - Specialty practice test questions and answers, training course, study guide are uploaded in ETE Files format by real users. Study and Pass AWS Certified Data Analytics - Specialty AWS Certified Data Analytics - Specialty (DAS-C01) certification exam dumps & practice test questions and answers are to help students.