
Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course
The complete solution to prepare for for your exam with Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course. The Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course contains a complete set of videos that will provide you with thorough knowledge to understand the key concepts. Top notch prep including Google Professional Data Engineer exam dumps, study guide & practice test questions and answers.
Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course Exam Curriculum
You, This Course and Us
-
1. You, This Course and Us
Introduction
-
1. Theory, Practice and Tests
-
2. Lab: Setting Up A GCP Account
-
3. Lab: Using The Cloud Shell
Compute
-
1. Compute Options
-
2. Google Compute Engine (GCE)
-
3. Lab: Creating a VM Instance
-
4. More GCE
-
5. Lab: Editing a VM Instance
-
6. Lab: Creating a VM Instance Using The Command Line
-
7. Lab: Creating And Attaching A Persistent Disk
-
8. Google Container Engine - Kubernetes (GKE)
-
9. More GKE
-
10. Lab: Creating A Kubernetes Cluster And Deploying A Wordpress Container
-
11. App Engine
-
12. Contrasting App Engine, Compute Engine and Container Engine
-
13. Lab: Deploy And Run An App Engine App
Storage
-
1. Storage Options
-
2. Quick Take
-
3. Cloud Storage
-
4. Lab: Working With Cloud Storage Buckets
-
5. Lab: Bucket And Object Permissions
-
6. Lab: Life cycle Management On Buckets
-
7. Lab: Running A Program On a VM Instance And Storing Results on Cloud Storage
-
8. Transfer Service
-
9. Lab: Migrating Data Using The Transfer Service
-
10. Lab: Cloud Storage ACLs and API access with Service Account
-
11. Lab: Cloud Storage Customer-Supplied Encryption Keys and Life-Cycle Management
-
12. Lab: Cloud Storage Versioning, Directory Sync
Cloud SQL, Cloud Spanner ~ OLTP ~ RDBMS
-
1. Cloud SQL
-
2. Lab: Creating A Cloud SQL Instance
-
3. Lab: Running Commands On Cloud SQL Instance
-
4. Lab: Bulk Loading Data Into Cloud SQL Tables
-
5. Cloud Spanner
-
6. More Cloud Spanner
-
7. Lab: Working With Cloud Spanner
BigTable ~ HBase = Columnar Store
-
1. BigTable Intro
-
2. Columnar Store
-
3. Denormalised
-
4. Column Families
-
5. BigTable Performance
-
6. Lab: BigTable demo
Datastore ~ Document Database
-
1. Datastore
-
2. Lab: Datastore demo
BigQuery ~ Hive ~ OLAP
-
1. BigQuery Intro
-
2. BigQuery Advanced
-
3. Lab: Loading CSV Data Into Big Query
-
4. Lab: Running Queries On Big Query
-
5. Lab: Loading JSON Data With Nested Tables
-
6. Lab: Public Datasets In Big Query
-
7. Lab: Using Big Query Via The Command Line
-
8. Lab: Aggregations And Conditionals In Aggregations
-
9. Lab: Subqueries And Joins
-
10. Lab: Regular Expressions In Legacy SQL
-
11. Lab: Using The With Statement For SubQueries
Dataflow ~ Apache Beam
-
1. Data Flow Intro
-
2. Apache Beam
-
3. Lab: Running A Python Data flow Program
-
4. Lab: Running A Java Data flow Program
-
5. Lab: Implementing Word Count In Dataflow Java
-
6. Lab: Executing The Word Count Dataflow
-
7. Lab: Executing MapReduce In Dataflow In Python
-
8. Lab: Executing MapReduce In Dataflow In Java
-
9. Lab: Dataflow With Big Query As Source And Side Inputs
-
10. Lab: Dataflow With Big Query As Source And Side Inputs 2
Dataproc ~ Managed Hadoop
-
1. Data Proc
-
2. Lab: Creating And Managing A Dataproc Cluster
-
3. Lab: Creating A Firewall Rule To Access Dataproc
-
4. Lab: Running A PySpark Job On Dataproc
-
5. Lab: Running The PySpark REPL Shell And Pig Scripts On Dataproc
-
6. Lab: Submitting A Spark Jar To Dataproc
-
7. Lab: Working With Dataproc Using The GCloud CLI
Pub/Sub for Streaming
-
1. Pub Sub
-
2. Lab: Working With Pubsub On The Command Line
-
3. Lab: Working With PubSub Using The Web Console
-
4. Lab: Setting Up A Pubsub Publisher Using The Python Library
-
5. Lab: Setting Up A Pubsub Subscriber Using The Python Library
-
6. Lab: Publishing Streaming Data Into Pubsub
-
7. Lab: Reading Streaming Data From PubSub And Writing To BigQuery
-
8. Lab: Executing A Pipeline To Read Streaming Data And Write To BigQuery
-
9. Lab: Pubsub Source BigQuery Sink
Datalab ~ Jupyter
-
1. Data Lab
-
2. Lab: Creating And Working On A Datalab Instance
-
3. Lab: Importing And Exporting Data Using Datalab
-
4. Lab: Using The Charting API In Datalab
TensorFlow and Machine Learning
-
1. Introducing Machine Learning
-
2. Representation Learning
-
3. NN Introduced
-
4. Introducing TF
-
5. Lab: Simple Math Operations
-
6. Computation Graph
-
7. Tensors
-
8. Lab: Tensors
-
9. Linear Regression Intro
-
10. Placeholders and Variables
-
11. Lab: Placeholders
-
12. Lab: Variables
-
13. Lab: Linear Regression with Made-up Data
-
14. Image Processing
-
15. Images As Tensors
-
16. Lab: Reading and Working with Images
-
17. Lab: Image Transformations
-
18. Introducing MNIST
-
19. K-Nearest Neigbors
-
20. One-hot Notation and L1 Distance
-
21. Steps in the K-Nearest-Neighbors Implementation
-
22. Lab: K-Nearest-Neighbors
-
23. Learning Algorithm
-
24. Individual Neuron
-
25. Learning Regression
-
26. Learning XOR
-
27. XOR Trained
Regression in TensorFlow
-
1. Lab: Access Data from Yahoo Finance
-
2. Non TensorFlow Regression
-
3. Lab: Linear Regression - Setting Up a Baseline
-
4. Gradient Descent
-
5. Lab: Linear Regression
-
6. Lab: Multiple Regression in TensorFlow
-
7. Logistic Regression Introduced
-
8. Linear Classification
-
9. Lab: Logistic Regression - Setting Up a Baseline
-
10. Logit
-
11. Softmax
-
12. Argmax
-
13. Lab: Logistic Regression
-
14. Estimators
-
15. Lab: Linear Regression using Estimators
-
16. Lab: Logistic Regression using Estimators
Vision, Translate, NLP and Speech: Trained ML APIs
-
1. Lab: Taxicab Prediction - Setting up the dataset
-
2. Lab: Taxicab Prediction - Training and Running the model
-
3. Lab: The Vision, Translate, NLP and Speech API
-
4. Lab: The Vision API for Label and Landmark Detection
Virtual Machines and Images
-
1. Live Migration
-
2. Machine Types and Billing
-
3. Sustained Use and Committed Use Discounts
-
4. Rightsizing Recommendations
-
5. RAM Disk
-
6. Images
-
7. Startup Scripts And Baked Images
VPCs and Interconnecting Networks
-
1. VPCs And Subnets
-
2. Global VPCs, Regional Subnets
-
3. IP Addresses
-
4. Lab: Working with Static IP Addresses
-
5. Routes
-
6. Firewall Rules
-
7. Lab: Working with Firewalls
-
8. Lab: Working with Auto Mode and Custom Mode Networks
-
9. Lab: Bastion Host
-
10. Cloud VPN
-
11. Lab: Working with Cloud VPN
-
12. Cloud Router
-
13. Lab: Using Cloud Routers for Dynamic Routing
-
14. Dedicated Interconnect Direct and Carrier Peering
-
15. Shared VPCs
-
16. Lab: Shared VPCs
-
17. VPC Network Peering
-
18. Lab: VPC Peering
-
19. Cloud DNS And Legacy Networks
Managed Instance Groups and Load Balancing
-
1. Managed and Unmanaged Instance Groups
-
2. Types of Load Balancing
-
3. Overview of HTTP(S) Load Balancing
-
4. Forwarding Rules Target Proxy and Url Maps
-
5. Backend Service and Backends
-
6. Load Distribution and Firewall Rules
-
7. Lab: HTTP(S) Load Balancing
-
8. Lab: Content Based Load Balancing
-
9. SSL Proxy and TCP Proxy Load Balancing
-
10. Lab: SSL Proxy Load Balancing
-
11. Network Load Balancing
-
12. Internal Load Balancing
-
13. Autoscalers
-
14. Lab: Autoscaling with Managed Instance Groups
Ops and Security
-
1. StackDriver
-
2. StackDriver Logging
-
3. Lab: Stackdriver Resource Monitoring
-
4. Lab: Stackdriver Error Reporting and Debugging
-
5. Cloud Deployment Manager
-
6. Lab: Using Deployment Manager
-
7. Lab: Deployment Manager and Stackdriver
-
8. Cloud Endpoints
-
9. Cloud IAM: User accounts, Service accounts, API Credentials
-
10. Cloud IAM: Roles, Identity-Aware Proxy, Best Practices
-
11. Lab: Cloud IAM
-
12. Data Protection
Appendix: Hadoop Ecosystem
-
1. Introducing the Hadoop Ecosystem
-
2. Hadoop
-
3. HDFS
-
4. MapReduce
-
5. Yarn
-
6. Hive
-
7. Hive vs. RDBMS
-
8. HQL vs. SQL
-
9. OLAP in Hive
-
10. Windowing Hive
-
11. Pig
-
12. More Pig
-
13. Spark
-
14. More Spark
-
15. Streams Intro
-
16. Microbatches
-
17. Window Types
About Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course
Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course by prepaway along with practice test questions and answers, study guide and exam dumps provides the ultimate training package to help you pass.
BigTable ~ HBase = Columnar Store
1. BigTable Intro
Why is it easier to add columns on the fly in Big Table than in Cloud Spanner? This is a question that I'd like you to think about. Let's say you have a complex application and you now want to go ahead and change your database design. You'd like to add a whole bunch of columns to existing tables. That is very difficult to do in cloud spanner. Big Table makes this very simple. Why is that? We've been going on and on about the similarities between HBase and cloud spanner.
So let's now turn our conversation to HBase and its GCP equivalent, Big Table. Recall that the big table is used when we want to carry out fast sequential scanning of data, which is in columnar format. Again, for fast sequential scanning with low latency. Our no-sequel tool of choice on the Google Cloud platform is Vic Table. Vic Table is quite indistinguishable from HBase under the hood. Like HBase, it is a columnar database, which is good for sparse data. We will examine what exactly a columnar database is in a lot more detail in just a moment. Like a cloud spanner, big tables store physical representations of sequential key values in sorted order. And this means that BigTable, like Cloud Spanner, is sensitive to hot spots. If the reads and the rights are not evenly distributed, performance can take a bad hit.
For most intents and purposes, we can assume that Big Table and HBase are synonymous. We'll have a little more to say on the relationship between Big Table and HBase. But in general, like any cloud tool, Big Table has a bunch of advantages. It is a managed version of HBase, and that's a much closer relationship than, say, that between Hive and BigQuery. The underlying representations of data in Hive and BigQuery are quite different. The advantages of Big Table or HBase are exactly the ones that you would expect. These are all associated with the use of a cloud platform's scalability, a low administrative and operational burden, the ability to resize clusters without downtime, and the ability to support many more column families before performance drops. We will get to that when we discuss the columnar data format. Because the connection between Big Table and each way is so strong, it makes sense for us to thoroughly understand the properties of HBase. To begin with, it's a columnar data store, which means that effectively, the representation has just three columns.
Well, actually, four. It supports denormalized storage. This is quite different from RDBMS. It focuses on the "Crud" operations of create, read, update, and delete. That is a more basic set of DDL operations than many RDBMS. And lastly, transaction support and HBase is pretty much nonexistent. The only operations where asset properties are guaranteed are row-level operations. So again, this is worth remembering. HBase is a base. At the row level, acid stands for atomicity, consistency, isolation, and durability. Let's go through these one by one and understand them in some detail. Let's start with the idea of a columnar data store. Let's say that you wish to store data for a notification service on an e-commerce website. Notifications have properties like the ID of the person to whom they were sent and the type of notification. This could be an offer or a sale notification, depending on the content of the notification message. In a traditional relational database, we would store this data in the form of a table with four rows and a bunch of columns.
In our relation, each row corresponds to one tuple, and this is the layout of any traditional IDBMS. The number of elements in each row corresponds to the length of the schema. Here, each row has four elements, and each of those elements has to correspond with and type according to the corresponding schema that's specified. To understand the significance of any one value, any one piece of data in a relational database, we must link it to both its corresponding rowID and its column. For example, where the notification ID was three, the value jill is the person to whom the notification was sent. Now let's check out how this exact same set of data items would be represented in a columnar data store. Effectively, in a columnar data store, there would only be three columns, and these three columns would map to the columns of our relational data as follows:
First up, there is an ID column. This is common between the columnar datastore and the relational database representation. The second column in the columnar data store is a column identifier. The column identifier is going to contain the values that correspond to the columns in the relational database. Effectively, what we've done is encode the columns from the RDBMS as fields in the columnar data store. And now, to complete the representation of any one row of data, we are going to need to add columns corresponding to each of the cell values from the RDBMS tuple. And notice how those cell values are associated with the column types. A couple of points jump out and quickly grab our attention. Every row from the relational database now has multiple rows in the column or store. In fact, it has one row for each column from the RDBMS. The other bit that jumps out is that the columnar data store is clearly not normalized. For instance, notice how the column identifiers for type and content appear repeatedly in the column of format. That is not normalised storage, and in nontraditional RTMs, that would be frowned upon.
To make up for this, however, the columnar data store has a couple of powerful advantages. The first of these has to do with the ease with which it handles sparse data. If you have data that has a lot of null values, you're not going to end up wasting much space. And the other has to do with the dynamic nature of attributes in columns. Notice how, in a columnar data store, we can go ahead and add new columns on the fly without changing the schema of our data store. If you wanted to add a column in an RDBMS, we would have had to carry out an alter table operation, which would have a significant penalty. Let's come back to the question we posed at the start of this video. There are actually two separate answers here. The first relates to why it's difficult to add columns on the fly in cloud spanner. And the second is why it's easy to add columns in BigTable. Let's talk about BigTable. First, this one is easier to understand. Big Table is a column database. So if you decide to add columns to some tables in your data set, all you need to do is insert new rows into your database. You do not need to change the schema in any way.
This is why adding columns dynamically is pretty easy in Big Table. Let's now talk about why it's difficult to add columns to cloud spanner. For one, cloud spherical is a relational database, and so each time you add columns, that's going to change the schema, and there are going to be a whole bunch of database rights. These rights will require transaction support, which will give you terrible performance. So that's one reason. Another more fundamental reason why adding columns is particularly difficult in cloud spanner is because of the nature of the underlying storage. Remember, the cloud spanner uses interleaving. It uses a complex form in which related data items are grouped together in a way that is not all that dissimilar from the table. That gives rise to a whole bunch of practical difficulties. However, when you want to change the schema and add columns.
2. Columnar Store
Generations of computer science students have grown up learning about the importance of normalisation in database design. Why, then, is it that distributed databases often compromise on normalization? What are the drawbacks of normalisation in the distributed world? This is a question that I would like you to ponder as you watch this video. We will come back to the answer at the end of the video. These two advantages of columnar data stores like BigTable and HBase are quite significant, so let's go ahead and make sure we really understand them.
Let's start out by understanding why columnar data stores are so much better at dealing with sparse data. Let's keep going with our discussions of notification data, because that's actually a good example of the kind of data where there are a bunch of missing values. Here, for instance, there might be notification types that have expiration dates.
These are offers that are going to expire at a certain point. Sale and offer notifications have expiration dates, but the other notification types do not. Also, it is entirely possible that order notifications have an "order status" field. This is a field that is specific to order notifications. If we wanted to accommodate all of these different types of notifications within a relational database table, we would effectively keep adding columns.
This would cause our table to get wider and wider, but these columns would only exist for a small subset of the total notification data. And so our data set, our relational database, would be filled up with more and more nulls, more and more empty values. It's also worth keeping in mind that columnar data stores like Big Table or HBase tend to operate on really large data sets on the order of petabytes. This is perhaps several orders of magnitude greater than relational databases. And that's why in relational databases we are OK to ignore the space occupied by missing values, but we are not okay ignoring this when we are dealing with petabytes of data in a columnar data store. So the fact that we have an extremely large dataset and that data set is very sparse, with a lot of empty values in each row, can become a real problem as the data set explodes in size.
This is where columnar stores come in handy because, as we can see, we simply do not have a row corresponding to a null value. For instance, notice that here we have a notification that has an expiry type field. That is because it's a sale or an offer notification, and these are the only notifications that will have rows in our columnar data store. The notifications that lack these attributes will simply not have rows corresponding to this field, and the result is that there is no wastage on empty cells. This example also demonstrates the other great advantage of columnar data stores, which is that they have the ability to add new attributes, such as new columns, dynamically on the fly as rows into our columnar data. If we wanted to add a new field or column to a relational database, we would have to use the Alter table command and then add more columns.
Those columns would have null values for most of the existing data. None of these issues arise when we attempt to add columns to a columnar data store in the form of new attributes. So the dynamic addition of attributes is yet another advantage of this type of data representation. One important aside: here in this conversation about bigtable and also in a preceding conversation about cloud spanner, we have been dealing with a lot of schematic diagrams of how data is laid out.
It should be noted that these are not necessarily accurate descriptions of how data is stored internally, but they are schematically correct, in the sense that you can use them as a good guide to how columnar data stores work or how interleaving in cloud spanner works. But don't get hung up. And don't assume or attempt to reverse engineer the actual physical storage of data in either of these technologies that has anything to do with the basic idea of a columnar data store and its benefits and drawbacks.
Let's now move on and talk about denormalized storage. We've already discussed how storage in columnar data stores does not fit into the traditional definitions of normalization. In traditional RDBMS, an important objective is minimising redundancy, and that's what gave rise to the different normal forms, and in particular to the third normal form, which is what most RDBMS shoot for. Let's understand this with an example. Let's say that we wish to minimise redundancy when we are storing data that has to do with employee details. And this data also includes subordinate and reporting relationships as well as the addresses. All of this is part of our dataset. Let's go ahead and see how we would design a table in the RDBMS world.
We would have one employee details table. This would contain information specific to an employee but separate from any subordinates or addresses, because one employee would have multiple subordinates and addresses. So subordinate and address information would reside in separate tables. For instance, there would be an employee-subordinate table.
This would link to the employee details table based on the ID column. And in a similar manner, we would have an employee address table once again here. This would link back to the employee table based on the ID. And in this way, we would deal seamlessly with the situation where an employee has multiple subordinates and multiple addresses. Now let's focus a little bit on the ID column. This is the column that holds together our data set.
Why did we decide to keep all of the employee details in one table and separate out the subordinate and address data? Well, because if we had had multiple subordinates per employee, we would have had to repeat all of the employee's specific information, such as name, function, grade, and so on. By having a separate employee subordinate table that links entirely on the basis of that one ID column, we only need to repeat the ID.
We do not need to repeat any of the other data items for that employee. And this also means that we are going to refer to an employee no matter what table we're talking about using that ID column. In a sense, this ID column is the key to this data set. And we have made our data granular by splitting it across multiple tables. We have also eliminated redundancy.
But we do have a more complex data model because now the same column, ID, is logically and semantically linked across these three tables. This is normalization, and this is the way traditional RDBMS do things. Let's come back to the question we posed at the start of the video. I think it's a really interesting one.
Normalization in traditional database design was largely driven by the need to save space. and that in turn was driven by the monolithic nature of database servers. You had one very big and powerful machine. A whole bunch of data had to be crabbed into that machine. And so the bottleneck was in the amount of data that you could fit onto that machine in a distributed database. All of a sudden, bandwidth is now the bottleneck.
The number of network accesses that you are going to need to perform and the number of different nodes that you will need to access in order to read data become the really expensive operations. And now, all of a sudden, normalization isn't such a great idea.
Let's say you normalise data and end up storing related data items in distant different nodes; even if you save a few bites, having to access the network three times instead of once will give you Aable performance. That is why, in a distributed worlddisk, seeks are more expensive than storage. As a result, denormalized data forms that group together all of the information that you require are becoming more popular.
3. Denormalised
Here's a question that I'd like you to think about as you go through the contents of this video. This is a true or false question. Is this statement true or false? Big Able supports equijoins, but it does not support constraints or indices. Equijoins are joined where the joint condition involves an equality check.
As we saw in the case of data store inequality, joints have some restrictions in terms of technologies. Is one of those technologies being able to? Is this statement true or false? Normalization and the traditional normal forms have existed as standards in RDBMS and database theory for decades. The basic idea here is to optimise the amount of storage.
As we've already seen, using the normalized forms allows us to save employee-specific details such as the name, grade, and so on just once. We do not have to repeat these for each subordinate or each address that the employee contains. But the reality now is that we are talking about working on a distributed file system, and here storage is actually very cheap because we have a large number of generic machines, each with a lot of attached storage.
What is really costly in a distributed file system is making a lot of disc seeks to servers or to data that resides on different machines. Now, for instance, if you wanted to get all the information about one employee, a normalised storage form would require us to look up three different tables, which might reside in three very different parts of the network, and that could impose a terrible performance penalty in a distributed system.
While this would be perfectly acceptable in a monolithic database server from the 1990s or 2001, columnar data stores eliminate the concept of normalization. They squish all of their data together so that all the data for one entity resides together. The immediate and obvious implication of this is that we have eliminated normalization. In fact, we have eliminated the first normal form with our subordinate data here because for every subordinate for a corresponding employee, we now have an array in the row corresponding to that employee, and arrays are compound data types, which violate the first normal form. We need to do something similar for address information, but their address information is structured, and this requires us to use something even more complex than an array. Here we are going to need to make use of a structure.
Each address is going to consist of a city and a zip code, and then for each employee, we are going to need to include a structure within that structure that will contain information about the cities and zip codes of all of the addresses of that employee. Notice here that everything that has to do with a particular employee is logically grouped together and effectively indexed by the row ID or the employee ID that is equal to one. The great advantage of this representation is that now you can get all the information about a particular employee. We just need to carry out one discrete action. All of these data items will be logically and physically stored close to each other in a distributed system. This can lead to an incredible improvement in performance, particularly if you are smart about how you sort and store your data items. And that's exactly what Big Table and HBase do.
Next up on the agenda, we need to understand the operations that HBase and Big Table support and do not support. Basically, HBase and Big Table only support Crud operations. "Crowd" stands for "create, read, update, and delete." Now, this is a far smaller set of operations than supported by traditional databases or SQL, where, for instance, we are used to complex operations across rows such as joins or group-by operations or sorting operations such as order-by. If you take a minute and stare at these three bits of functionality, which are supported in Sequel and RDBMS but are not supported in HBase, what jumps out? What do these operations have in common? You guessed it, they were involved in some capacity. Some kind of comparison, sorting, or equality or inequality check across different rows in the same data.
HBase is very row-centric; it basically scans data by a row key. And it doesn't really understand operations that compare groups of rows with each other. And that's why HBase and BigTable are both not SQL technologies. They do not support SQL. And the reason for this, of course, ties back to their underlying data representation, where a row is basically their basic unit of viewing the world. It turns out that eBay only allows a very limited set of operations. These are the current operations that we've already discussed. It's okay to create data sets, it's okay to read data, it's okay to update specific data items, and it's okay to delete data items. As we shall see, most of these are indexed by a row key.
That's the only way that HBase knows how to access data. So current operations are supported by HBase. More complex joins, order by, and aggregation operations are not. This is an important point. This is important to keep in mind, particularly on the Google Cloud Platform, where BigQuery, which is the Hive-like version and offers a similar interface on top of cloud storage, does a lot better in terms of performance than Hive. So if you need to choose between BigQuery and Big Table, do remember these limitations of Big Table.
Big Table will throw up its hands, and it will not support any operations involving multiple tables. It does not support indexes on tables other than the row key, which is the ID column, and it does not support constraints. None of these facilities are available to you if you want to use Big Table. And all of these restrictions, which could seem arbitrary, will make sense if you keep in mind that all data needs to be self-contained within one row. That is the basic underlying premise of columnar data stores like Big Table and HBase. Let's now move on to the next property of HBase that is important for us to understand, which is the fact that HBase only supports acid at the row level. Recall that as it stands for atomicity, consistency, isolation, and durability, this is transaction support as provided by a traditional RDBMS. Now, in HBase, updates to a single row are atomic.
So effectively, any operations that you carry out that affect a particular row ID will be all or nothing. Either all of the columns corresponding to that row will be affected, or none will be. However, this is only applicable to a single row. Updates to multiple rows are not atomic, even if your update is to the same column but on multiple rows. Again, this should come as no surprise to us once we've understood the underlying representation of data in HBase. The worldview of a column in a data store is restricted to groups of data with the same row ID. Once you cross the boundary of a row, all bets are off.
So let's quickly summarise all of the differences between a traditional RDBMS and a column that is stored like HBase. Remember that in an RDBMS, data is arranged in rows and columns, but in HBase, data is arranged in columns only. That's where the name "columnar data store" comes from. Traditional relational databases are SQL-compliant. In fact, they are, by definition, SQL databases. HBase is a prominent example of a NoSQL database. Specifically, it is a key-value store. Traditional RDBMS and database design place a high value on normalization. That's because it minimizes redundancy and optimises the amount of space taken up by data. However, columnar data stores like HBasedo do not care about denormalization.
In fact, they intentionally denormalize data in order to make it easier and faster to access related data items in a distributed file system. HBase operates at much larger dataset sizes than traditional RDBMS. That has implications for transaction support and asset compliance. HBase is only going to support asset properties at the row level. Multirow operations are not asset-compliant. Let's come back to the question that we posed. This statement is false. Big Table is about as non-sequel as it gets. It does not support any operations across tables. It does not support joins of any form, constraints, or indices. Again. A big table is pretty hardcore. No sequel. It does not support any operations across tables. Everything is only at the level of raw data and column families.
Prepaway's Professional Data Engineer: Professional Data Engineer on Google Cloud Platform video training course for passing certification exams is the only solution which you need.
Pass Google Professional Data Engineer Exam in First Attempt Guaranteed!
Get 100% Latest Exam Questions, Accurate & Verified Answers As Seen in the Actual Exam!
30 Days Free Updates, Instant Download!

Professional Data Engineer Premium Bundle
- Premium File 319 Questions & Answers. Last update: Feb 09, 2025
- Training Course 201 Video Lectures
- Study Guide 543 Pages
Student Feedback
Comments * The most recent comment are at the top
Can View Online Video Courses
Please fill out your email address below in order to view Online Courses.
Registration is Free and Easy, You Simply need to provide an email address.
- Trusted By 1.2M IT Certification Candidates Every Month
- Hundreds Hours of Videos
- Instant download After Registration
A confirmation link will be sent to this email address to verify your login.
Please Log In to view Online Course
Registration is free and easy - just provide your E-mail address.
Click Here to Register