Wednesday, January 1, 2025

AWS Data Engineer Associate Certification Notes

 

Type of Data:

Structured,           Un-Structured and                                 Semi-Structured 

RDBMS        Documents/Videos/Audio/email/image       xml/json/log files     


Properties of Data

Volume - volume of data GB, TB , PB?

Velocity - velocity of data being generated, collected and processed. Real-time or batch?

Variety - Type of data. Struct/UnStruct or Semi-Struct.





Data formats:

Text: csv (structured), xml/json (semi-structured)

Binary: Avro (stores data and schema together). Use for BigData. Good for serialization for transport between systems. Apache Kafka, Flink, Spark and Hadoop ecosystems supports this.

Parquet: Columnar data store. Efficient compression. Apache Hive, Impala, Redshift spectrum, Spark


Data Lineage: Visual representation that traces the flow and transformation of data through its lifecycle, from source to destination.


Data Sampling Techniques: Random Sampling, Stratified Sampling ( divide data into homogenous groups and then create subgroups) and Systemic, cluster and etc.


Data Skew Mechanism: Imbalance of data/unequal distribution. Celebrity problem. To avoid skew, 1/ Adaptive partitioning (dynamically adjust)

 2/ Salting (Introduce random factor to partition) 3/ Repartitioning 4/Sample the data to check distribution


Data Validation: Missing data, null counts, percentage of populated fields. Consistent data. Data integrity check (relation and trustworthy)


RegEx: ~ for regex chech.

 ^ starts with. $ ends with. [a-z] range . [a-z]{3} length of 3. abc|xyz ..checks for both strings. \d digit. \w word. \t  Tab. \s whitespaces


DynamoDB:

1 WCU is 1 Write of data 1KB in size in 1 sec

1 RCU is 1 Strongly consistent Read of data 4KB in size in 1 sec

1 RCU is 2 Eventual consistent Read of data 4KB in size in 1 sec


If the writes are throttled on GSI then the main table gets throttled as well.


Amazon Memory DB: Redis compatible DB. Multi AZ. Scales from 10s of GB to 100s of TB

Amazon KeySpaces: Managed Cassandra across 3 MultiAZ. Serverless. Uses Cassandra Query Language. On-Demand and Provisioned capacity

Amazon Neptune: Graph DB. Social N/W.

Amazon TimeStreamDB: Automatic scales up and down. Timestream storage for analyzing trillion of events a day. IoT , Kinesis Data Streams, MSK and etc can send data to TimeStreamDB.


Amazon Redshift: OLAP, SQL compatible queries. Does result caching


1 Redshift Cluster has 1 leader Nodes and 1-many Compute Node. Each Compute Node has dedicated CPU, Memory and attached disk storage. Also, has 1-Many slices (separate cpu and memory)..and number of slices depends on Cluster Size selected. Slices process the data chunk fed to it.

Redshift Spectrum: Allows to query exabytes of data from S3 without loading data into Redshift. Supports AVRO, CSV, Parquet and etc. gz and snappy compression supported.


Redshift uses Massively Parallel Processing (MPP), Columnar Data Storage, Column Compression.

Use COPY command to copy data to redshift. Automatic does the compression. 

Redshift replicates your data within cluster, backs it up on S3 Async, Automated Snapshots. Multi-AZ available for RA3 Cluster. 2 nodes cluster is recommended. 


Redshift Distribution Styles: 

Auto (Distribution based on data size. Default value but it's either of Even, Key or All), 

Even (Row based distribution across slices in Round Robin fashion), ….this is good if no join expected between tables.

Key (Distribution based on a one Key Column) …matching key gets stored on a single slice under a compute node.

All (Table copied to every node) …double the storage as it's redundant


COPY Command: Allows you to copy data from S3, EMR, DynamoDB or remote hosts.

S3 COPY requires a Manifest file (json)

copy customer from 's3://amzn-s3-demo-bucket/cust.manifest' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;      

manifest file example:

{

  "entries": [

    {"url":"s3://amzn-s3-demo-bucket1/custdata.1","mandatory":true},

    {"url":"s3://amzn-s3-demo-bucket1/custdata.2","mandatory":true},

    {"url":"s3://amzn-s3-demo-bucket2/custdata.1","mandatory":false}

  ]

}


{  

   "entries":[  

      {  

         "url":"s3://amzn-s3-demo-bucket1/orc/2013-10-04-custdata",

         "mandatory":true,

         "meta":{                                   …meta attribute is required for Parquet and ORC format

            "content_length":99

         }

      },

      {  

         "url":"s3://amzn-s3-demo-bucket2/orc/2013-10-05-custdata",

         "mandatory":true,

         "meta":{  

            "content_length":99

         }

      }

   ]

}


UNLOAD Command : to get data from Redshift table exported to S3.

Enhanced VPC Routing to make sure data transfer happens through VPC and not internet.

Auto Copy from S3…monitors the data in S3 and copies to Redshift

Amazon Aurora Zero ETL - Get Aurora data replicated to Redshift.

From Redshift table to table copy, use INSERT INTO ..SELECT or create table as 

Create Copy Grant to copy Redshift Snapshot to another region which is encrypted with KMS Key

Use DBLink to copy/sync data between Redshift and PostGreSQL. Create extension dblink 

Redshift WLM (WorkLoad management)..queue management for fast, slow queries. Setup different queues.

Concurrency Scaling to add burst capacity to heavy read. Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.

Configuring query queues : Priority, query groups, user group or concurrency scaling mode. Upto 8 queues. 5 default. Concurrency level of 50 ..that is 50 queries running at once.

For every queue, we can define allocated memory & timeout .

Short Query Acceleration (SQA): Works for Short Queries..alternative to WLM queue for Short Queue.

VACCUM: Cleans up tables. Recover space from deleted rows. VACCUM FULL (recovers and re-sort). REINDEX (REIndexes and recovers disk space)


Redshift resizing: 

Elastic: Add or remove nodes of same type. Can either double or halve using elastic resize.

Classic: Allows to change node type. Takes a long time for resize. Takes hours or days.

              Snapshot, restore or resize. Allows to shift traffic in Classic resize without downtime.


RA3 Nodes: Decoupled Storage and Compute capacity.

Redshift data lake export: Allows Parquet format UNLOAD to S3.

Share Live Data using Cross Region Data Sharing.

Redshift ML --> Uses AutoPilot of SageMaker. Model.fit() --Train model model.predict() --Prediction


HSM with Client and Server setup to work with Redshift.

Redshift Serverless: Capacity measured as Redshift Processing Units


Redshift Materialized Views: Stores the actual response of the query. It's prec-computed data. 

Could have stale data if underlying table data changed. Use Auto refresh or Manual Refresh Materialized view


Redshift Lambda UDF: Create external function . And use the function in query

CREATE EXTERNAL FUNCTION exfunc_sum(INT,INT)

RETURNS INT

VOLATILE

LAMBDA 'lambda_sum'

IAM_ROLE 'arn:aws:iam::123456789012:role/Redshift-Exfunc-Test';


Query in segment--> lambda use resource --> 


Redshift federated queries: Allows Redshift to connect to Aurora or other DBs to get query response. CREATE EXTERNAL SCHEMA 


Application Migration Service (MGN): Previously, Cloud Endure migration


AWS DataSync: NFS, SMB, HDFS on-premise to Cloud. Agent based.

For AWS -AWS transfer, no agent needed.

File permissions and metadata preserved. 


SnowFamily:

SnowCone: Comes with Agent installed. Good for 8TB HDD to 14TB SSD.

Snowball Edge: 80TB -210 TB 


EC2 is behind EMR

Compute Nodes have Data. Task nodes don't have data.


Transform in YAML file indicates that it's a SAM template.


sam package is short form for aws cloudformation package


ECS runs ECS Agent on EC2.


AWS Glue: Serverless Discovery and definition of tables. Metadata extraction. Uses Apache Spark internally.


Glue Crawler or Data Catalog: Scans the data from S3 and etc and creates a catalog/metadata. With metadata identified, you can query the data using Athena, EMR, Redshift or Quicksight



You can import a Hive metastore into Glue Catalog.

Glue has Spark Clusters running underneath it. Supports Python and Scala both. You can provide your own Spark or PySpark code.

For adding new partitions or updating table schema, you'll have to re-run the crawler or enableUpdateCatalog/ updateBehavior.

Glue ETL supports serverless streamed ETL with Kinesis and Kafka.

Glue DynamicFrame allows Semi-structured data as well. Spark DataFrame is more suited for Structured data.

Glue Data Quality: Validates the data via data quality rules.

Glue DataBrew: Visual Data Prep tool. Pre-processing the data. Create recipes as runbook. Alternative to use Glue ETL.

Glue Workflow: Multi-job workflow . Triggers within workflow starts job or crawler. Orchestrates Glue jobs.

AWS Lake Formation: 

• Setup a secure data lake. Allows you to setup S3 Data Lake. Source could be all sources of Glue. Query with Athena, EMR and Redshift Spectrum.

• Cross Account Lake Formation needs Resource Access Manager sharing and the recipient registered as admin with necessary IAM permissions. Doesn't allow cross account query access via Redshift spectrum.

• Governed Tables: ACID with Data Lake. Automatic Compaction (Combine small data files into larger, more efficient files to optimize table performance.)

• Row and Cell level access control both for Governed and S3 Tables.

• Data filters in Lake formation allows row, cell and column level access. Specific row and specific columns leads to cell level security.


Amazon Athena:

• Serverless Interactive Query Service for Data in S3.

• Uses Presto DB.

• Supports csv, json, parquet (columnar), avro (splittable), orc(columnar). Supports all compression.

• Data can be structured, semi-structured or un-structured 

• Quicksight integrates with Athena as a dataset via Glue Table

• Athena Workgroup allows you to limit query usage, data scanned (data limits) and permissions for running Athena queries.

• Columnar queries are cheaper and faster.

• Small number of large files perform better than large number of small files.

• If adding partition later on , use MSCK REPAIR TABLE

• Athena ACID Transactions are backed by Iceberg under the hood. Table type ="ICEBERG" create table. Allows time travel.

• If ACID transactions are getting slower, then OPTIMIZE Table REWRITE DATA USING BIN PACK

• Uses FireCracker under the hood to allow Apache Spark based notebook and run calculation.


Apache Spark:

• Distributed data processing for Big Data. In-memory caching and query optimization.

• Supports Java, Scala , Python and R.

• Spark Context --> Spark Clusters/Manager --> Spark Executors

• Spark SQL is distributed querying engine. 100x faster than MapReduce

• Graph X Distributed Graph processing ETL


CREATE TABLE <> WITH (format, write_compression,  external_location) AS SELECT * FROM


Elastic Map Reduce(EMR):

• EMR Cluster is a collection of EC2 instances running Master Node/Leader Node (manages the cluster), Core Node(Hosts HDFS Data) and Task nodes(Runs tasks) ..use spot instances for Task nodes.

• Transient clusters - Shuts down one the job is complete.

• Storage in S3 for input and output

• HDFS on EMR is on local storage of EC2. Ephemeral storage

• EMRFS uses S3 as File Storage.

• MapReduce --> mappers (key/value pairs extracts data) and reducers(joins the results back).

• YARN --> Abstraction layer between MapReduce and HDFS. Manages the clusters.

• EMR allows Centralized RBAC via Apache Ranger


Kinesis Data Streams:

• Define the shards of Kinesis Data Streams. 

• Input from Producers at a rate of 1MB/sec or 1000 msg/sec per shard.

• Consumption from Shard at 2MB/sec per shard for all consumers or 2MB/sec for each consumers with enhanced fan out.

• OnDemand Mode allows 4MB/sec or 4000 records/sec

• KPL records must be decoded using KCL

• For more batching, increase RecordMaxBufferedTime to aggregate more records.

• AWS SDK allows you to send only latest data. This is an advantage on top of using KPL.

• Enhanced Fan out improves throughput as well as latency.

• Shard splitting allows more throughput with existing shard split into 2 new shards and the existing one getting deleted.

• Out of Order records can happen because of resharding. Best practice is to read from Parent shard after resharding activity. KCL has this inbuilt.

• Due to N/W timeouts, the ACK might not reach from Kinesis Data Streams to KPL which may lead to duplicate entries.


Kinesis Data Firehose:

• KDF has destinations as S3, Redshift (COPY from S3), OpenSearch and Splunk

• Firehose has buffer limits for size and time. Post that limits, the buffer is flushed.

• KDF is near real time whereas Kinesis Streams is Real Time.

• 500 errors indicates high error rates > 1%. Implement retries.


Managed Service for Apache Flink:

• Source could be Kinesis Data Streams or Managed Streaming for Apache Kafka Service (MSK)

• Apache Flink does the processing

• Sink for Flink is S3, Kinesis Data Streams or Firehose.

• Detect Anomaly using RANDOM_CUT_FOREST via Kinesis Data Analytics


Managed Service for Apache Kafka: (MSK)

• MSK creates, manages Kafka nodes and Zookeeper nodes 

• Apache Kafka can be configured to send messages upto 10 MB. Default is 1MB. Kinesis Data Streams has a hard limit of 1MB.

• Producer sends data to Kafka Topic . Consumers pull data from Kafka.

• Mutual TLS for AuthN and ACLs for AuthZ.

• MSK Connect Connector allows Plugin to connect to various destinations.

• MSK Serverless is PayAsYouGo. Define your partitions and topics.

• Kinesis Data Streams is Streams with Shard and MSK is Topics with Partitions.


Amazon OpenSearch:

• Forked from Elasticsearch and Kibana. Documents storage and retrieval systems.

• Search Engine

• Formerly Kibana now Dashboards

• Full-Text Search , Log Analytics, App Monitoring, Security Analytics.

• OpenSearch based on Apache lucene search engine.

• OpenSearch Storage types: Hot has EBS Volumes. Ultra-Warm uses S3 + Caching. Cold Storage uses S3 as well.


Amazon Quicksight is available inside a VPC using ENI. (Enterprise version only)


SQS: Message size limit of 256KB


AWS Step Functions:

• Parallel runs separate branches of execution

• Map : Runs a step for each item in dataset. Used in Data processing.


MWAA: (Amazon Managed Workflow for Apache Airflow)

• Written in Python. Creates DAG (Directed Acyclic Graph). Batch oriented workflow.

• DAG Python Code is uploaded on S3. MWAA picks it up and executes the DAG


AWS Data Pipeline:

• Allows data from different sources RDS, DynamoDB to be put in S3.  No longer supported.


CloudWatch Logs:


• Subscription filter allows you to send the log streams to Kinesis Data Firehose, Lambda, Kinesis Data Stream and OpenSearch.

• Allows Logs to be sent from multiple account to central/single Destination Account via (KDF , OpenSearch & Kinesis Data Stream) using Subscription filter.


CloudTrail:

• CloudTrail event insights does anomaly detection for API error rate and call rate.


CDK:

• Uses constructs and allows you to use different languages such as Typescript, Python, Java, .NET, JS



• MWAA over Step Functions if extensive customizations is needed in terms of custom operators and scripts.

• WLM is more for resource allocation rather than faster query executions.

• Delta with Amazon EMR provides ACID transaction capabilities, scalable metadata handling, and time travel (bi-temporal querying).

•  EFS can be mounted to Lambda, allowing the function to access larger files without being limited by the Lambda package size limit.

• AWS DMS introduced in 2022 a Schema Conversion feature that automates the conversion of the Microsoft SQL Server database schema to be compatible with Aurora MySQL.

•  Spark UI provides comprehensive information about the Spark job's execution, which is vital for troubleshooting and optimizing performance.

• AWS Data Exchange is designed for precisely this purpose. It allows users to easily find, subscribe to, and use third-party data sets directly in the AWS cloud.

• EMR is a cloud-native big data platform that supports a broad range of processing frameworks, including Apache Pig, Presto, and Apache Flink.

• Athena CTAS Syntax: CREATE TABLE new_table WITH (       format = 'Parquet',       write_compression = 'SNAPPY')

• AS SELECT * FROM old_table;

• SELECT * FROM orders WHERE city ~ '^(San|El).*'    ….^ ..means starts with.  The '.*' following the group allows any sequence of characters afterward, ensuring that all city names starting with these prefixes are selected.

• Structuring data in Apache Iceberg or Delta Lake format on S3 and using AWS Glue for applying CDC updates is the most efficient method.

• When workgroup-wide settings are enforced in Athena, they apply to all queries running in that workgroup. This means that these settings will override any differing client-side settings for query execution.

• For data warehouses intended to support complex analytical queries over large datasets, a star schema is often the most effective design. It features a central fact table that contains transactional data, and dimension tables that describe attributes related to the facts.


• AWS Glue Workflows:This is the best and most profitable option. AWS Glue is designed specifically for ETL on AWS and integrates directly with data sources such as Microsoft SQL Server through connectors. This allows for easier configuration and avoids the need for additional development.

• Glue Workflow only orchestrate crawlers and glue jobs

• Redshift Data sharing lets you share live data, without having to create a copy or move it. Database administrators and data engineers can use data sharing to provide secure, read-only access to data for analytics purposes, while maintaining control over the data. 

• If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

• By using federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon S3 environments. Federated queries can work with external databases in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.

• Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables when materialized views are created with or altered to have the autorefresh option. Amazon Redshift autorefreshes materialized views as soon as possible after base tables changes.

• State Function's Map state allows you to define a single execution path for processing a collection of data items in parallel.

• Glue's FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.

• Cross-Account Delivery: Kinesis Data Streams in the security account ensures the logs reside in the designated security-focused environment.

• CloudWatch Logs Integration: Granting CloudWatch Logs permissions to put records into the Kinesis Data Streams.

• Creating an AWS Glue partition index and enabling partition filtering can significantly improve query performance when dealing with large datasets with many partitions. The partition index allows Athena to quickly identify the relevant partitions for a query, reducing the time spent scanning unnecessary data. Partition filtering further optimizes the query by only scanning the partitions that match the filter conditions.

• Athena partition projection based on the S3 bucket prefix is another effective technique to improve query performance. By leveraging the bucket prefix structure, Athena can prune partitions that are not relevant to the query, reducing the amount of data that needs to be scanned and processed. This approach is particularly useful when the data is organized in a hierarchical structure within the S3 bucket.

• STL_ALERT_EVENT_LOG records any alerts/notifications related to queries or user-defined performance thresholds. This would capture optimizer alerts about potential performance issues.

• STL_PLAN_INFO provides detailed info on execution plans. The optimizer statistics and warnings provide insight into problematic query plans.

• STL_USAGE_CONTROL limits user activity but does not log anomalies.

• STL_QUERY_METRICS has execution stats but no plan diagnostics.

• With Amazon Redshift query editor v2, you can automate SQL queries to run on a schedule.