Pass Amazon AWS Certified Machine Learning - Specialty Exam in First Attempt Guaranteed!
Get 100% Latest Exam Questions, Accurate & Verified Answers to Pass the Actual Exam!
30 Days Free Updates, Instant Download!
AWS Certified Machine Learning - Specialty Premium Bundle
- Premium File 317 Questions & Answers. Last update: Feb 15, 2024
- Training Course 106 Lectures
- Study Guide 275 Pages
Last Week Results!
|Download Free AWS Certified Machine Learning - Specialty Exam Questions
Size: 1.05 MB
Size: 1.45 MB
Size: 1.37 MB
Size: 901.38 KB
Size: 866.62 KB
Size: 610.59 KB
Size: 231.67 KB
Size: 856.88 KB
Size: 606.81 KB
Size: 566.92 KB
Amazon AWS Certified Machine Learning - Specialty Practice Test Questions and Answers, Amazon AWS Certified Machine Learning - Specialty Exam Dumps - PrepAway
All Amazon AWS Certified Machine Learning - Specialty certification exam dumps, study guide, training courses are Prepared by industry experts. PrepAway's ETE files povide the AWS Certified Machine Learning - Specialty AWS Certified Machine Learning - Specialty (MLS-C01) practice test questions and answers & exam dumps, study guide and training courses help you study and pass hassle-free!
Exploratory Data Analysis
8. Overview of Amazon Quicksight
Next, let's talk about the Amazon Quick Site. Again, for the purpose of the exam, we don't need a tonne of detail. You just need to know what it is, what it's for, and how it fits into other systems. So what is "quick sight"? Well, it's a cloud-powered business analytics service. Is that how they describe it as fast and easy? Of course, that's subjective, but it's really meant for all employees in an organisation to build visualisations off of data. So it's a little bit different from other AWS services in that it's geared toward a more general audience and not so much toward developers so much.
It's made to let any analyst within your company perform ad hoc analysis on your data sets. As a result, you can gain business insights from your data at any time and on any device. It works on browsers and mobile devices, and it's also completely serverless, because we don't expect your analysts in your company to be managing their own servers. Obviously, Quick Site can connect to a wide variety of data sources. It might just be a CSV file sitting in S3, or it might be a Redshift data warehouse.
It could be a conventional relational database hosted in Aurora or the RDS. It could be coming from Athena, like we talked about. Maybe it's coming from a database that you're rehosting on your own EC2 instance somewhere. And maybe it's just a file that you're feeding into it directly. It could even be an Excel file in addition to CSV or TSV files or other common log formats that might be out there as well. Those could come from SThree or from on-site. And it does allow for some limited ETL as well, as part of its data preparation capabilities. So you can do some simple things like changing field names and data types, adding some calculated columns to your data, or even issuing some simple sequel queries to add additional columns to your data set or manipulate that data. Spice is a topic we frequently discuss at Quick Site.
This is the name of their superfast parallel in-memory calculation engine. It's their own proprietary thing that uses column restoration, memory, processing, and machine code generation to make QuickSight really, really fast. It's what makes it able to accelerate interactive queries on massive data sets. Every user in QuickSight gets 10 GB of Spice usage. It's highly available, durable, and scalable, but you just have to remember that Spice is the mechanism by which QuickSight is as quick as it is. Some QuickSite use cases for interactive ad hoc data exploration and visualisation Obviously, that's what it's for.
You can also build dashboards and KPI dashboards. It also has this thing called "stories," which are like guided tours through specific views of an analysis. So you can sort of convey your key points and thought processes, or the evolution of an analysis, in narrative form. Using QuickSite Stories, it can analyse and visualise data from a variety of sources, like we talked about. In addition to S3 and various databases, it can also connect to services such as Salesforce and, well, basically any data source that exposes a JDBC or ODBC endpoint interface, if you will. In the context of machine learning, we should also talk about Quicksites' Machine Learning Insights feature. It has three major features to it.One is ML-powered anomaly detection. So Amazon QuickSite is using the random cutforest algorithm that Amazon developed to analyse millions, if not billions, of data points and rapidly detect what the outliers are. The things there kind of lie outside the expected balance, right?
It also has ML-powered forecasting. Amazon QuickSite enables nontechnical users to confidently forecast their key business metrics, as they put it. And again, it's using random cutforce to automatically handle real-world scenarios such as detecting seasonality and trends. So it's part of detecting those outliers. It also needs to understand what the seasonality and trends are in your time series data, and their forecasting capabilities are what extract that.
It also has something cool called autonarratives. Basically, it's a way to automatically build a dashboard that tells the story of your data in plain language. An example of that is shown on this slide, where it says "total blah for blah." You know, from blah to blah, increased by blah and decreased for blah. You know, it's not as fancy as it sounds, but it's just a way of translating the trends and seasonality in your data into words that you can actually put into a report somewhere.
So it might make life a little bit easier. kind of a cool application of machine learning. Anyway, those are the machine-learning capabilities of QuickSight. Anomaly detection, forecasting, and automatic generation of narratives That's it. Okay, so when it comes to QuickSite, this is all I can do in the context of machine learning, at least right now. Some anti-patterns for QuickSite say if you want highly formatted canned reports, that's not really what it's for. It's more for ad hoc queries, analysis, and visualization—sort of exploring the data interactively for ETL. You don't want to be using QuickSight either. That's what glue is for. Although QuickSight can do some limited transformations from a security standpoint, it does offer multi-factor authentication for every account with QuickSight. It works nicely with VPCs, and it has row-level security and also private VPC access if you want that as well. You would use an Elastic Network interface or AWSDirect Connect to enable those capabilities in the space of user management.
Users can be defined via IAM, the identity and access management service with AWS, or through an email sign-up process. It can also integrate with ActiveDirectory using the QuickSite Enterprise edition. Enterprise pricing is the norm. We talked about enterprises having that Active Directory integration, which costs a lot more at $18 per user per month as opposed to the standard $9 per user per month before standard at present.If you require more Spice capacity than the default 10GB, you can purchase it. And you can also just pay month-to-month for that Spice capacity. And what the Enterprise Edition offers is both Active Directory integration and encryption at rest, which is double for those two things.
9. Types of Visualizations, and When to Use Them
to make things a little bit more real. Here's what a dashboard on the Quick site would actually look like: Dashboard is now a read-only snapshot of a previously created analysis. Once you've created a dashboard, you can share that dashboard with other users who have access to Quicksite, but they cannot edit or change those filters. So here's an example that they have from Universal Scientific of a sales dashboard. As you can see, it's just a collection of various charts and graphs of relevant performance indicators that they care about all in one spot.
It has multiple visual types available to you in QuickSight. The handiest thing is Autograph, where it just automatically selects the most appropriate visualisation based on the properties of the data itself, instead of making you select one yourself. These visualisation types have been chosen to best reveal the data and relationships for you. So how would you use different types of visualisation for different types of problems?
Well, if you're trying to do comparison or distribution like a histogram, maybe a bar chart would be a good choice. And there are very many different flavours of bar charts at your disposal as well. Line graphs are appropriate for looking at changes over time, trends, and seasonality like we talked about, right? If you're looking for a correlation between two different things—two different features of data—that's where a scatter plot or a heat map might come into play. Pi graphs and tree maps are useful for aggregating data and visualising how things aggregate together. And if you have tabular data that you just want to show in different ways, a pivot table might be a good choice.
Finally, we have stories, which are narratives that present iterations of your data. They are used to convey key points, a thought process, or the evolution of an analysis. For collaboration. You construct them in QuickSight by capturing and annotating specific states of an analysis as you go. When readers of the story click on an image in the story, they are taken into the analysis at that point, where they can further explore it on their own. Let's dive into some specific examples here. Here's what a bar chart looks like: Again, these are intended for comparisons or distributions. In this case, we're looking at the human losses of World War II by country, and you can see that certain countries took a much bigger hit than others in terms of a percentage of their population and in terms of raw people as well. The Soviet Union is leading the pack, and the United States is pretty far down the list there.
You may not have been aware of the histogram of aeroplane arrivals per minute. Again, a good choice of a bar chart there for a histogram. Line charts, as we said, are for changes over time, looking for trends, seasonality, and things like that. So speed versus time Here, you can see that those dots make sense to be connected because there is a trend that we're trying to extract from those data points. And if you have different components of your data that contribute to an overall total, you can represent how those different components add up using a stacked line chart as well. Scatter plots are useful for correlation. So, for example, if you wanted to plot the eruption duration versus the waiting time between eruptions on the Old Faithful geyser in Yellowstone National Park here in America, you might have data points that look like this, and you might be able to try to fit a line there to see if there's some sort of correlation there.
But if you just want to eyeball the raw data and see if there is a correlation of some sort, that's a useful use of a scatter plot. Heat maps are also useful for correlation. It's just showing you, based on the colour of each combination of attributes here, how often that specific combination occurs. So depending on what colour scales you're using, maybe red means hot here, and we can see that in the upper right-hand corner. There's something interesting going on that we might want to dig into in this heat map. Pie charts are useful for aggregation. In this particular pie chart of something, we can see that the USA is contributing the most things to that overall total. The United Kingdom is second, followed by Canada, Australia, and other countries.
I think you all know what a pie chart is. One thing you might not have seen before is a tree map. It's also something that QuickSite can do. These are used for hierarchical aggregation. It's kind of a neat thing there. You can see here that we're looking at agricultural production, and we can see how raw cotton appears there, and that's further broken down into other subcategories. You can kind of think about it as pie charts within pie charts, if you will. So, you know, this brown area here might represent a certain classification of things, and then we have further sub-categories within it that are broken down. So that's what a tree map is. Pivot tables. If you've used Excel, you know what those are all about. It's just a way of organising your data in different ways and aggregating it in arbitrary ways. So, for example, we have raw sales data in the top chart here, and if we pivot on region and ship date, we can view sales by region over time. So that's a good way to interactively explore your tabular data. And again, this is something that Quick Site supports.
10. Elastic MapReduce (EMR) and Hadoop Overview
Next, let's dive into the world of elastic mapping reduce, or EMR for short. It is a managed Hadoop framework that runs on EC2 instances, and its name is kind of confusing because Map Reduce itself is kind of an obsolete part of Hadoop, and it's a lot more than just a Hadoop cluster on EMR. So yeah, it does include Hadoop and MapReduce in there, but it's the stuff that's built on top of Hadoop that you tend to use more often. technologies like Spark, HBase, Presto, Flink, or Hive.
And even more things come preinstalled with EMR, potentially. There's also something called an EMR notebook. That's much like a Jupiter notebook that runs on your EMR cluster, and it has several integration points with AWS services as well.
This is relevant for the world of machine learning because if you have a massive data set that you need to prepare, normalize, scale, or otherwise treat before it goes into your algorithms, EMR provides a way of distributing the load of processing that data across an entire cluster of computers. So for massive data sets, often you need a cluster to actually process that data and prepare it for your machine learning training jobs in parallel, across an entire cluster. A cluster and EMR are a collection of EC2 instances, where every EC2 instance is called a node.
Each node has a role within the cluster, which is called the node type. First is the master node that manages the cluster by running software components to coordinate the distribution of data and tasks among the other nodes for processing. It tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node. You can also create a single-node cluster with just a master node if you have just enough processing that runs on a single machine. The master node is also sometimes referred to as the leader node. We also have core nodes.
These are nodes with software components that run tasks and store data on the Hadoop distributed file system, or HDFS for short. On your cluster, multinode clusters will have at least one core node, and then we have task nodes. These are nodes with software components that only run tasks and do not store data on HDFS. These are optional. You don't need them, but they are a good use of spot instances because you can introduce them and take them out of the cluster as needed because they don't really impact the storage on your cluster. Since they don't actually talk to HDFS, the Hadoop Distributed File System, there aren't any permanent files stored on the task node that your cluster needs, so they're only used for computation. So if you have a sudden task that needs to run on a massive amount of data and you're only going to do that once, you can introduce a task note into your cluster and then remove it when you're done. And your cluster will just continue to happily run without it after you've removed it. There are a couple of ways to use an EMR cluster. There's a transient and a long-running cluster.
A transient cluster is configured to be automatically terminated once all the steps that you've defined to run on it have been completed. So the steps you might run would include loading input data, processing that data, storing the results of the output, and then shutting down the entire cluster. So if you have a predefined sequence of things that you want your cluster to do, you can save money by automatically terminating that cluster as soon as that task has been completed. Long-running clusters are also possible, in which you create a cluster, interact directly with the applications on it, and then manually terminate it when you're finished.
So this is more appropriate for, say, ad hoc queries or experimenting with data sets where you don't really know what you want to do upfront and you don't have some repeatable sequence that you just want to run over and over again. In that case, you should run a long cluster and then manually terminate it when you're finished. Now, when you launch a cluster, you select the frameworks and applications to install for your data processing needs at that time. Once you have a cluster spun up, either way, you can connect to it directly through the masternode, through EC2, and run jobs from the terminal there, or you can submit ordered steps via the console as well, if you can predefine those through the console and AWS, either way it works. We talked about integration points between EMR and AWS.
Obviously, those are important. For example, it uses EC2, of course, with the actual underlying instances that comprise the nodes on your cluster. It can work with Amazon VPC to host your cluster within a virtual network. You can store your input data and your output data on Amazon S3 instead of HDFS if you wish. You can also use Amazon Cloud Watch to monitor the performance of your cluster and configure alarms on the individual nodes in your cluster. IAM can be used to configure permissions for your cluster.
Cloud Trail will create an audit trail for any requests made to the services in your cluster. And finally, AWS data pipeline can be used to schedule and start your clusters that are running a series of steps that are predefined. Let's talk a little bit more about the storage on EMR. Now, the default storage solution on Hadoop is HDFS, and this is a Hadoop system. So we do have HDFS available to you. It's a distributed, scalable file system for Hadoop. It distributes the data that it stores across every instance in your cluster, and it stores multiple copies of the data on different instances to ensure that no data is lost if an individual instance fails.
Every file in HDFS is stored as blocks and is distributed across the HTTP cluster. By default, the size of a block in HDFS is 128 megabytes. Now. This is ephemeral storage. Once you terminate your cluster, the storage that was stored locally on those nodes goes with it. So that's the reason not to use HDFS. However, it's going to be a lot faster, right? We don't have to go across the Internet to access that data. It's all done locally on the nodes that are processing your data. And Hadoop has a lot of smarts built into it so that it tries to optimise things so that the node where the code is running to process a bit of data is the same node where that data is stored. So HDFS is very good from a performance standpoint, but it has the downside that when you shutdown your cluster, that data goes away.
That's not a good thing. However, we have Em RFS, which allows you to use it as though it were an HDFS file system, and it turns out it's still pretty darn fast. There's also an optional consistent view in Emrfs that's for S-3 consistency, and you can actually use DynamoDB to track that consistency across the Emrfs. So a key thing with EMR is that you can use S3 in place of HDFS. That's the key point there. You can also, of course, use the local file system if you want to for ephemeral things.
But again, that's not going to be distributed. So that is only useful, for example, on the master node, where you're trying to basically stage data and get it where it needs to be. We can also back HDFS with elastic blockstores, so EBS and HDFS have a relationship. There are EMR charges by the hour, plus any EC charges under the hood. It does promise that it will provision new notes for you automatically if a core node fails, so you don't have to worry about that at least. And you can add and remove task nodes on the fly as well, like we talked about. You can use spot instances to add additional capacity or remove capacity without impacting the underlying storage on the HDFS file system. You can also resize the core nodes on a running cluster as well. You can get away with that for capacity planning too.
11. Apache Spark on EMR
So, what exactly is Hadoop? It's made up of a few different modules, which we'll go over, and the rest of the stuff on your EMRcluster frequently builds on top of these modules as well. So when we talk about Hadoop itself, we're usually talking about the HDFS file system, Yarn, and Map Reduce. And underlying all of these is something called Hadoop Core or Hadoop Common.
That would be the libraries and utilities required for all these modules to run on top of. So it provides a file system and operating system stuff that it needs to abstract itself from the OS and all the Java archive files and scripts that are needed to start Hadoop itself at the lowest level.
Above that would be HDFS, which is the Hadoop Distributed File System, a distributed, scalable file system for Hadoop. Again, it distributes the data it stores across the instances in the cluster. Multiple copies of the data are stored on different instances to ensure that no data is lost. If an individual instance fails, it is lost upon the termination of the cluster, however. But it's useful for caching intermediate results during MapReduce processing or for workloads that have significant random IO. on top of HDFS.
We've got hadoop yarn. Yarn stands for yet another resource, the negotiator. It's a component introduced in Hadoop 20 to centrally manage cluster resources from multiple data processing frameworks that enables us to use things other than MapReduce, as we'll see shortly. So what is Map Reduce? Well, Map Reduce is a software framework for easily writing applications that process vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It dates back to some ideas from Google in its early days of big data processing. MapReduce consists of mapper functions and reducer functions that you write in code.
A map function maps data to sets of key-value pairs called the intermediate results. And map functions generally do things like transform your data, reformat it, or extract the data that you need. That's why it's relevant to the world of exploratory data analysis. Then we have reduced functions that combine those intermediate results, apply additional algorithms, and produce the final output. In general, mappers are the things that transform and prepare your data. The reducers are aggregating that data and distilling it down to the final answer that you want. However, these days, Apache Spark has largely taken the place of MapReduce.
And thanks to the existence of Yarn, Spark can actually sit on top of your Hadoop cluster and use the underlying resource negotiater and file system that Hadoop offers, but just offer a faster alternative to MapReduce. So Spark, which can be optionally installed on your EMR cluster, is an open-source distributed processing system commonly used for big data workloads. It's really hot right now. It uses memory caching optimised query execution instead of MapReduce for fast analytic queries against data of any size. So it uses something called a directed acyclic graph. That's kind of its main speed trick. Compared to MapReduce, it can be smarter about the dependencies in processing and how to schedule those more effectively.
Spark has APIs for Java, Scala, Python, and R, and it supports code reuse across multiple workloads like batch processing and interactive queries in real-time analytics and machine learning and graph processing. It has a bunch of different use cases that we'll talk about: stream processing, machine learning, and interactive SQL. However, Spark is generally not used for OLTP or batch processing jobs. It's more for transforming data as it comes in. How Spark works under the hood Well, Spark applications are run as an independent set of processes on a cluster. They're all coordinated by the SparkContext object of the main program. This is known as the Driver Program. That's the actual code that you write to make your Spark job run.
The Spark Context connects to different cluster managers, which allocate resources across the application. So, for example, Yarn or Spark have their own built-in ones you can use if you're not on a Hadoop cluster. Upon connecting, Spark will acquire executor nodes in the cluster. The executors are processes that run computations and store data for your applications. The application code is sent to the executors, and in the final step, the Spark Context will send tasks to the executors to run.
Spark itself has many different components, just like Hadoop does. So underlying everything is the Spark core. It acts as the foundation for the platform. It's responsible for things like memory management, fault recovery, scheduling, distributing and monitoring your jobs, and interacting with storage systems. It has APIs for Java, Gallopython, and R.
And at the lowest level, it uses something called a "resilient distributed data set," or RDD, that represents a logical collection of data partitioned across different compute nodes. Now, as we'll see, there's a layer above that with Spark SQL, which is a distributed query engine that provides low-latency interactive queries up to 100 times faster than Map Reduce. It includes a cost-based optimizer column or storage and code generation for fast queries.
And it supports various data sources coming from JDBC, ODBC, JSON, HDFS, Hive, ORK, or Parque files. It also supports querying Hive tables using HiveQL if you want. But the really important thing about Spark SQL is that it exposes something called a data frame in Python or a data set in Scala. And this is sort of taking the place of the lower-level, resiliently distributed data sets in Spark these days.
So modern Spark code tends to interact with data in much the same way as you would with a data frame in Pandas or a database table. In a relational database, you can actually issue SQL commands to your Spark cluster, and under the hood, it will figure out how to actually transform that into a distributed query that executes across your entire cluster. So much useful information. Also, we have Spark streaming.
That's a real-time solution that leverages Sparkcorp's fast scheduling capabilities to do streaming analytics. So data gets ingested in minibatches, and analytics on that data within the same application code written for batch analytics can be applied to those minibatches. So it's pretty cool because you can use the same code that you wrote for batch processing and apply that to your real-time streaming. with Spark streaming. It supports ingestion from Twitter, KafkaFlume, HDFS, and Zero MQ. as we'll see. It can also integrate with AWS Kinesis. We also have MLIB, the machine learning library for Spark. Obviously, that's relevant to the machine learning exam. We'll talk about that in more depth, of course, and what it can do. And finally, we have Graphics, which is a distributed graph processing framework built on top of Spark.
We're not talking about charts and graphs like QuickSight here; we're talking about computer science graphs here. So, for example, a graph of people in a social network is more of a data structure thing. It provides ETL, exploratory analysis, and iterative graph computation to enable users to interactively build and transform a graph data structure at scale. So let's talk about ML-Lib a little bit more. It offers several different machine learning algorithms, but what's special is that they're all implemented in a way that is distributed and scalable. So not all machine learning algorithms really lend themselves to parallel processing. A lot of them sort of need to be reimagined in order to spread that loadout across an entire cluster of computers.
So with Spark ML Lib, it has a specific set of things that it can do for classification tasks. It offers logistic regression, which is still a great alternative, and it can also do regression across a cluster of decision trees. It has a recommender engine built-in, built on alternating lease squares. It has a clustering k-means implementation built in for topic modeling. It has LDA if you want to extract topics from text information in documents in an unsupervised manner.
And it has various workflow utilities that are useful for machine learning in general, such as pipelines, feature transformation, and persistence. It also has distributed implementations of the SVD, PCA, and statistics functions as well. We'll go over what those mean later in the modelling section if you're unfamiliar with them. But again, the special thing here is that this can run on a cluster. A lot of these algorithms will not run on the cluster in their default state if you were to run them on, say, Sci Kitlearn, for example.
So with Spark ML Lib, that allows you to process massive data sets and actually train machine learning models on them across an entire cluster. And as we'll see later in the course, you can even include Spark within Sage Maker as well. Spark streaming stirs a little bit more discussion as well. Generally, Spark applications use a data set in their code that refers to your data, which is treated a lot like a database table. With Spark streaming, that table just keeps on growing as new chunks of data are received in real time. And you can query that data using windows of time.
So, for example, you can look at the past hour of data in your data stream coming in and just query that like a database, basically. So that's a high-level view of how structured streaming works. In Spark, it models inbound streaming data as basically an unbounded database table that you can query whenever you want to.
And Spark streaming does integrate with Kinesis. You could have a Kinesis producer publish data to a Kinesis data stream. And there is a way in the KCl to actually implement a Spark data set built on top of that data stream that you can just query like any other Spark data set. Another key point in Spark is the zeppelin. That's basically a notebook for Zeppelin.
It allows you to run Spark code interactively within a notebook environment in a browser. So you can execute sequel queries against your Spark data using Spark SQL. You can also query yours and visualise them in charts and graphs using things like Matplotlib and Seaborne. So it makes Spark feel a lot more like a data science tool and allows you to actually preprocess your data in a familiar format that data scientists are used to.
Amazon AWS Certified Machine Learning - Specialty practice test questions and answers, training course, study guide are uploaded in ETE Files format by real users. Study and Pass AWS Certified Machine Learning - Specialty AWS Certified Machine Learning - Specialty (MLS-C01) certification exam dumps & practice test questions and answers are to help students.
Comments * The most recent comment are at the top
IT Certification Tutorials
- Top Career Opportunities for Financial Certified Professionals
- Top Project Management Certifications to Improve Your CV
- Top 10 Computer Job Titles That Will Rule the Future
- Discontinuation of ITIL v3 in 2022 And New Technological Era
- GAQM CSM-001 Certified Scrum Master - Chapter 04 - Meetings in Scrum Part 3
- Python Institute PCAP - Modules; Packages and Object Oriented Programming in Python Part 3
- PMI PMP Project Management Professional - Introducing Project Risk Management Part 3
- CompTIA CASP+ CAS-004 - Chapter 01 - Understanding Risk Management Part 3
- DA-100 Microsoft Power BI - Part 2 Level 2 - Getting Multiple files
- CompTIA CASP+ CAS-004 - Chapter 04 - Implementing Security for Systems; Applications; and Storage Part 3
- IIBA CBAP - Tasks of Business Analysis Planning and Monitoring
- MB-210 Microsoft Dynamics 365 - Create and Manage Product and Product Catalog Part 2
- Salesforce Certified Platform App Builder - 5 - Business Logic and Process Automation Part 3
- Amazon AWS Certified Data Analytics Specialty - Domain 4: Analysis
- Google Professional Cloud Network Engineer - Designing; Planning; and Prototyping a GCP Network Part 3