Praveen

Wednesday, January 1, 2025

AWS Data Engineer Associate Certification Notes

Type of Data:

Structured, Un-Structured and Semi-Structured

RDBMS Documents/Videos/Audio/email/image xml/json/log files

Properties of Data

Volume - volume of data GB, TB , PB?

Velocity - velocity of data being generated, collected and processed. Real-time or batch?

Variety - Type of data. Struct/UnStruct or Semi-Struct.

Data formats:

Text: csv (structured), xml/json (semi-structured)

Binary: Avro (stores data and schema together). Use for BigData. Good for serialization for transport between systems. Apache Kafka, Flink, Spark and Hadoop ecosystems supports this.

Parquet: Columnar data store. Efficient compression. Apache Hive, Impala, Redshift spectrum, Spark

Data Lineage: Visual representation that traces the flow and transformation of data through its lifecycle, from source to destination.

Data Sampling Techniques: Random Sampling, Stratified Sampling ( divide data into homogenous groups and then create subgroups) and Systemic, cluster and etc.

Data Skew Mechanism: Imbalance of data/unequal distribution. Celebrity problem. To avoid skew, 1/ Adaptive partitioning (dynamically adjust)

2/ Salting (Introduce random factor to partition) 3/ Repartitioning 4/Sample the data to check distribution

Data Validation: Missing data, null counts, percentage of populated fields. Consistent data. Data integrity check (relation and trustworthy)

RegEx: ~ for regex chech.

^ starts with. $ ends with. [a-z] range . [a-z]{3} length of 3. abc|xyz ..checks for both strings. \d digit. \w word. \t Tab. \s whitespaces

DynamoDB:

1 WCU is 1 Write of data 1KB in size in 1 sec

1 RCU is 1 Strongly consistent Read of data 4KB in size in 1 sec

1 RCU is 2 Eventual consistent Read of data 4KB in size in 1 sec

If the writes are throttled on GSI then the main table gets throttled as well.

Amazon Memory DB: Redis compatible DB. Multi AZ. Scales from 10s of GB to 100s of TB

Amazon KeySpaces: Managed Cassandra across 3 MultiAZ. Serverless. Uses Cassandra Query Language. On-Demand and Provisioned capacity

Amazon Neptune: Graph DB. Social N/W.

Amazon TimeStreamDB: Automatic scales up and down. Timestream storage for analyzing trillion of events a day. IoT , Kinesis Data Streams, MSK and etc can send data to TimeStreamDB.

Amazon Redshift: OLAP, SQL compatible queries. Does result caching

1 Redshift Cluster has 1 leader Nodes and 1-many Compute Node. Each Compute Node has dedicated CPU, Memory and attached disk storage. Also, has 1-Many slices (separate cpu and memory)..and number of slices depends on Cluster Size selected. Slices process the data chunk fed to it.

Redshift Spectrum: Allows to query exabytes of data from S3 without loading data into Redshift. Supports AVRO, CSV, Parquet and etc. gz and snappy compression supported.

Redshift uses Massively Parallel Processing (MPP), Columnar Data Storage, Column Compression.

Use COPY command to copy data to redshift. Automatic does the compression.

Redshift replicates your data within cluster, backs it up on S3 Async, Automated Snapshots. Multi-AZ available for RA3 Cluster. 2 nodes cluster is recommended.

Redshift Distribution Styles:

Auto (Distribution based on data size. Default value but it's either of Even, Key or All),

Even (Row based distribution across slices in Round Robin fashion), ….this is good if no join expected between tables.

Key (Distribution based on a one Key Column) …matching key gets stored on a single slice under a compute node.

All (Table copied to every node) …double the storage as it's redundant

COPY Command: Allows you to copy data from S3, EMR, DynamoDB or remote hosts.

S3 COPY requires a Manifest file (json)

copy customer from 's3://amzn-s3-demo-bucket/cust.manifest' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

manifest file example:

{

"entries": [

{"url":"s3://amzn-s3-demo-bucket1/custdata.1","mandatory":true},

{"url":"s3://amzn-s3-demo-bucket1/custdata.2","mandatory":true},

{"url":"s3://amzn-s3-demo-bucket2/custdata.1","mandatory":false}

]

}

{

"entries":[

{

"url":"s3://amzn-s3-demo-bucket1/orc/2013-10-04-custdata",

"mandatory":true,

"meta":{ …meta attribute is required for Parquet and ORC format

"content_length":99

}

{

"url":"s3://amzn-s3-demo-bucket2/orc/2013-10-05-custdata",

"mandatory":true,

"meta":{

"content_length":99

}

]

}

UNLOAD Command : to get data from Redshift table exported to S3.

Enhanced VPC Routing to make sure data transfer happens through VPC and not internet.

Auto Copy from S3…monitors the data in S3 and copies to Redshift

Amazon Aurora Zero ETL - Get Aurora data replicated to Redshift.

From Redshift table to table copy, use INSERT INTO ..SELECT or create table as

Create Copy Grant to copy Redshift Snapshot to another region which is encrypted with KMS Key

Use DBLink to copy/sync data between Redshift and PostGreSQL. Create extension dblink

Redshift WLM (WorkLoad management)..queue management for fast, slow queries. Setup different queues.

Concurrency Scaling to add burst capacity to heavy read. Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.

Configuring query queues : Priority, query groups, user group or concurrency scaling mode. Upto 8 queues. 5 default. Concurrency level of 50 ..that is 50 queries running at once.

For every queue, we can define allocated memory & timeout .

Short Query Acceleration (SQA): Works for Short Queries..alternative to WLM queue for Short Queue.

VACCUM: Cleans up tables. Recover space from deleted rows. VACCUM FULL (recovers and re-sort). REINDEX (REIndexes and recovers disk space)

Redshift resizing:

Elastic: Add or remove nodes of same type. Can either double or halve using elastic resize.

Classic: Allows to change node type. Takes a long time for resize. Takes hours or days.

Snapshot, restore or resize. Allows to shift traffic in Classic resize without downtime.

RA3 Nodes: Decoupled Storage and Compute capacity.

Redshift data lake export: Allows Parquet format UNLOAD to S3.

Share Live Data using Cross Region Data Sharing.

Redshift ML --> Uses AutoPilot of SageMaker. Model.fit() --Train model model.predict() --Prediction

HSM with Client and Server setup to work with Redshift.

Redshift Serverless: Capacity measured as Redshift Processing Units

Redshift Materialized Views: Stores the actual response of the query. It's prec-computed data.

Could have stale data if underlying table data changed. Use Auto refresh or Manual Refresh Materialized view

Redshift Lambda UDF: Create external function . And use the function in query

CREATE EXTERNAL FUNCTION exfunc_sum(INT,INT)

RETURNS INT

VOLATILE

LAMBDA 'lambda_sum'

IAM_ROLE 'arn:aws:iam::123456789012:role/Redshift-Exfunc-Test';

Query in segment--> lambda use resource -->

Redshift federated queries: Allows Redshift to connect to Aurora or other DBs to get query response. CREATE EXTERNAL SCHEMA

Application Migration Service (MGN): Previously, Cloud Endure migration

AWS DataSync: NFS, SMB, HDFS on-premise to Cloud. Agent based.

For AWS -AWS transfer, no agent needed.

File permissions and metadata preserved.

SnowFamily:

SnowCone: Comes with Agent installed. Good for 8TB HDD to 14TB SSD.

Snowball Edge: 80TB -210 TB

EC2 is behind EMR

Compute Nodes have Data. Task nodes don't have data.

Transform in YAML file indicates that it's a SAM template.

sam package is short form for aws cloudformation package

ECS runs ECS Agent on EC2.

AWS Glue: Serverless Discovery and definition of tables. Metadata extraction. Uses Apache Spark internally.

Glue Crawler or Data Catalog: Scans the data from S3 and etc and creates a catalog/metadata. With metadata identified, you can query the data using Athena, EMR, Redshift or Quicksight

You can import a Hive metastore into Glue Catalog.

Glue has Spark Clusters running underneath it. Supports Python and Scala both. You can provide your own Spark or PySpark code.

For adding new partitions or updating table schema, you'll have to re-run the crawler or enableUpdateCatalog/ updateBehavior.

Glue ETL supports serverless streamed ETL with Kinesis and Kafka.

Glue DynamicFrame allows Semi-structured data as well. Spark DataFrame is more suited for Structured data.

Glue Data Quality: Validates the data via data quality rules.

Glue DataBrew: Visual Data Prep tool. Pre-processing the data. Create recipes as runbook. Alternative to use Glue ETL.

Glue Workflow: Multi-job workflow . Triggers within workflow starts job or crawler. Orchestrates Glue jobs.

AWS Lake Formation:

• Setup a secure data lake. Allows you to setup S3 Data Lake. Source could be all sources of Glue. Query with Athena, EMR and Redshift Spectrum.

• Cross Account Lake Formation needs Resource Access Manager sharing and the recipient registered as admin with necessary IAM permissions. Doesn't allow cross account query access via Redshift spectrum.

• Governed Tables: ACID with Data Lake. Automatic Compaction (Combine small data files into larger, more efficient files to optimize table performance.)

• Row and Cell level access control both for Governed and S3 Tables.

• Data filters in Lake formation allows row, cell and column level access. Specific row and specific columns leads to cell level security.

Amazon Athena:

• Serverless Interactive Query Service for Data in S3.

• Uses Presto DB.

• Supports csv, json, parquet (columnar), avro (splittable), orc(columnar). Supports all compression.

• Data can be structured, semi-structured or un-structured

• Quicksight integrates with Athena as a dataset via Glue Table

• Athena Workgroup allows you to limit query usage, data scanned (data limits) and permissions for running Athena queries.

• Columnar queries are cheaper and faster.

• Small number of large files perform better than large number of small files.

• If adding partition later on , use MSCK REPAIR TABLE

• Athena ACID Transactions are backed by Iceberg under the hood. Table type ="ICEBERG" create table. Allows time travel.

• If ACID transactions are getting slower, then OPTIMIZE Table REWRITE DATA USING BIN PACK

• Uses FireCracker under the hood to allow Apache Spark based notebook and run calculation.

Apache Spark:

• Distributed data processing for Big Data. In-memory caching and query optimization.

• Supports Java, Scala , Python and R.

• Spark Context --> Spark Clusters/Manager --> Spark Executors

• Spark SQL is distributed querying engine. 100x faster than MapReduce

• Graph X Distributed Graph processing ETL

CREATE TABLE <> WITH (format, write_compression, external_location) AS SELECT * FROM

Elastic Map Reduce(EMR):

• EMR Cluster is a collection of EC2 instances running Master Node/Leader Node (manages the cluster), Core Node(Hosts HDFS Data) and Task nodes(Runs tasks) ..use spot instances for Task nodes.

• Transient clusters - Shuts down one the job is complete.

• Storage in S3 for input and output

• HDFS on EMR is on local storage of EC2. Ephemeral storage

• EMRFS uses S3 as File Storage.

• MapReduce --> mappers (key/value pairs extracts data) and reducers(joins the results back).

• YARN --> Abstraction layer between MapReduce and HDFS. Manages the clusters.

• EMR allows Centralized RBAC via Apache Ranger

Kinesis Data Streams:

• Define the shards of Kinesis Data Streams.

• Input from Producers at a rate of 1MB/sec or 1000 msg/sec per shard.

• Consumption from Shard at 2MB/sec per shard for all consumers or 2MB/sec for each consumers with enhanced fan out.

• OnDemand Mode allows 4MB/sec or 4000 records/sec

• KPL records must be decoded using KCL

• For more batching, increase RecordMaxBufferedTime to aggregate more records.

• AWS SDK allows you to send only latest data. This is an advantage on top of using KPL.

• Enhanced Fan out improves throughput as well as latency.

• Shard splitting allows more throughput with existing shard split into 2 new shards and the existing one getting deleted.

• Out of Order records can happen because of resharding. Best practice is to read from Parent shard after resharding activity. KCL has this inbuilt.

• Due to N/W timeouts, the ACK might not reach from Kinesis Data Streams to KPL which may lead to duplicate entries.

Kinesis Data Firehose:

• KDF has destinations as S3, Redshift (COPY from S3), OpenSearch and Splunk

• Firehose has buffer limits for size and time. Post that limits, the buffer is flushed.

• KDF is near real time whereas Kinesis Streams is Real Time.

• 500 errors indicates high error rates > 1%. Implement retries.

Managed Service for Apache Flink:

• Source could be Kinesis Data Streams or Managed Streaming for Apache Kafka Service (MSK)

• Apache Flink does the processing

• Sink for Flink is S3, Kinesis Data Streams or Firehose.

• Detect Anomaly using RANDOM_CUT_FOREST via Kinesis Data Analytics

Managed Service for Apache Kafka: (MSK)

• MSK creates, manages Kafka nodes and Zookeeper nodes

• Apache Kafka can be configured to send messages upto 10 MB. Default is 1MB. Kinesis Data Streams has a hard limit of 1MB.

• Producer sends data to Kafka Topic . Consumers pull data from Kafka.

• Mutual TLS for AuthN and ACLs for AuthZ.

• MSK Connect Connector allows Plugin to connect to various destinations.

• MSK Serverless is PayAsYouGo. Define your partitions and topics.

• Kinesis Data Streams is Streams with Shard and MSK is Topics with Partitions.

Amazon OpenSearch:

• Forked from Elasticsearch and Kibana. Documents storage and retrieval systems.

• Search Engine

• Formerly Kibana now Dashboards

• Full-Text Search , Log Analytics, App Monitoring, Security Analytics.

• OpenSearch based on Apache lucene search engine.

• OpenSearch Storage types: Hot has EBS Volumes. Ultra-Warm uses S3 + Caching. Cold Storage uses S3 as well.

Amazon Quicksight is available inside a VPC using ENI. (Enterprise version only)

SQS: Message size limit of 256KB

AWS Step Functions:

• Parallel runs separate branches of execution

• Map : Runs a step for each item in dataset. Used in Data processing.

MWAA: (Amazon Managed Workflow for Apache Airflow)

• Written in Python. Creates DAG (Directed Acyclic Graph). Batch oriented workflow.

• DAG Python Code is uploaded on S3. MWAA picks it up and executes the DAG

AWS Data Pipeline:

• Allows data from different sources RDS, DynamoDB to be put in S3. No longer supported.

CloudWatch Logs:

• Subscription filter allows you to send the log streams to Kinesis Data Firehose, Lambda, Kinesis Data Stream and OpenSearch.

• Allows Logs to be sent from multiple account to central/single Destination Account via (KDF , OpenSearch & Kinesis Data Stream) using Subscription filter.

CloudTrail:

• CloudTrail event insights does anomaly detection for API error rate and call rate.

CDK:

• Uses constructs and allows you to use different languages such as Typescript, Python, Java, .NET, JS

• MWAA over Step Functions if extensive customizations is needed in terms of custom operators and scripts.

• WLM is more for resource allocation rather than faster query executions.

• Delta with Amazon EMR provides ACID transaction capabilities, scalable metadata handling, and time travel (bi-temporal querying).

• EFS can be mounted to Lambda, allowing the function to access larger files without being limited by the Lambda package size limit.

• AWS DMS introduced in 2022 a Schema Conversion feature that automates the conversion of the Microsoft SQL Server database schema to be compatible with Aurora MySQL.

• Spark UI provides comprehensive information about the Spark job's execution, which is vital for troubleshooting and optimizing performance.

• AWS Data Exchange is designed for precisely this purpose. It allows users to easily find, subscribe to, and use third-party data sets directly in the AWS cloud.

• EMR is a cloud-native big data platform that supports a broad range of processing frameworks, including Apache Pig, Presto, and Apache Flink.

• Athena CTAS Syntax: CREATE TABLE new_table WITH ( format = 'Parquet', write_compression = 'SNAPPY')

• AS SELECT * FROM old_table;

• SELECT * FROM orders WHERE city ~ '^(San|El).*' ….^ ..means starts with. The '.*' following the group allows any sequence of characters afterward, ensuring that all city names starting with these prefixes are selected.

• Structuring data in Apache Iceberg or Delta Lake format on S3 and using AWS Glue for applying CDC updates is the most efficient method.

• When workgroup-wide settings are enforced in Athena, they apply to all queries running in that workgroup. This means that these settings will override any differing client-side settings for query execution.

• For data warehouses intended to support complex analytical queries over large datasets, a star schema is often the most effective design. It features a central fact table that contains transactional data, and dimension tables that describe attributes related to the facts.

• AWS Glue Workflows:This is the best and most profitable option. AWS Glue is designed specifically for ETL on AWS and integrates directly with data sources such as Microsoft SQL Server through connectors. This allows for easier configuration and avoids the need for additional development.

• Glue Workflow only orchestrate crawlers and glue jobs

• Redshift Data sharing lets you share live data, without having to create a copy or move it. Database administrators and data engineers can use data sharing to provide secure, read-only access to data for analytics purposes, while maintaining control over the data.

• If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

• By using federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon S3 environments. Federated queries can work with external databases in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.

• Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables when materialized views are created with or altered to have the autorefresh option. Amazon Redshift autorefreshes materialized views as soon as possible after base tables changes.

• State Function's Map state allows you to define a single execution path for processing a collection of data items in parallel.

• Glue's FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.

• Cross-Account Delivery: Kinesis Data Streams in the security account ensures the logs reside in the designated security-focused environment.

• CloudWatch Logs Integration: Granting CloudWatch Logs permissions to put records into the Kinesis Data Streams.

• Creating an AWS Glue partition index and enabling partition filtering can significantly improve query performance when dealing with large datasets with many partitions. The partition index allows Athena to quickly identify the relevant partitions for a query, reducing the time spent scanning unnecessary data. Partition filtering further optimizes the query by only scanning the partitions that match the filter conditions.

• Athena partition projection based on the S3 bucket prefix is another effective technique to improve query performance. By leveraging the bucket prefix structure, Athena can prune partitions that are not relevant to the query, reducing the amount of data that needs to be scanned and processed. This approach is particularly useful when the data is organized in a hierarchical structure within the S3 bucket.

• STL_ALERT_EVENT_LOG records any alerts/notifications related to queries or user-defined performance thresholds. This would capture optimizer alerts about potential performance issues.

• STL_PLAN_INFO provides detailed info on execution plans. The optimizer statistics and warnings provide insight into problematic query plans.

• STL_USAGE_CONTROL limits user activity but does not log anomalies.

• STL_QUERY_METRICS has execution stats but no plan diagnostics.

• With Amazon Redshift query editor v2, you can automate SQL queries to run on a schedule.

Monday, December 9, 2024

Use Jenkins CI/CD Pipeline to Deploy Kubernetes Pod/Deploy/Service on AWS EKS

In this blog, I'll try to list down the steps needed to use Jenkins (Deployed on an EC2 Machine) to utilize the AWS EKS (Managed K8S Service) and use Kubectl commands to create Service/Deployment and Pods on EKS.

Steps:

1. Deploy Jenkins on Amazon EC2. I have used Amazon Linux AMI.

2. You can use files from my public Github repo: https://github.com/pkumar8789/sample-k8s-jenkins.git

Once your Jenkins is Up & Running:

Install required Plugins: In this case, Install Kubernetes CLI & Kubernetes Plugin

Configure the Jenkins to connect to EKS Server:

Click on Clouds --> New Cloud

Scroll down and Go to Credentials

Use the option of Secret File.

*** This secret file should be the KubeConfig file post you have set the Kube Context. I have provided a sample KubeConfig file in my Github repo.

Kubeconfig file is generally created using the command below

aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw cluster_name)

https://developer.hashicorp.com/terraform/tutorials/kubernetes/eks

Create a New Item in jenkins and use the below Jenkins pipeline format to connect to EKS and execute kubectl commands:

Note: We are having one additional step to do AWS Configure to set the user profile.

Sunday, August 16, 2020

AWS Solution Architect Professional Notes

Hi All, While preparing for AWS Solution Architect Professional course, I created notes for me and always used to revise the same before the mock test. It did refresh me on the topics and key points.

Thought of sharing it with others. Enjoy :)

Updated notes as per Re-certification in 2023:

Timeseries DB stores time series data

Traffic mirroring allows to track message content as well

Amazon Data Life Cycle Manager automates EBS snapshots backups

S3 Batch replication --> Copies S3 existing data to another bucket

S3 Replication controller --> initiates copies of new Objects in S3 to desired location

IAM Access Analyzer --> policy recommendations, warning, findings

IAM Credential report --> credentials, access token and etc

AWS Config Conformance packs --> collection of config rules and remediation steps

Patch manager --> approve, reject patches to keep security up to date

AWS Backup: List of services below:

AWS Elastic Disaster Recovery (AWS DRS) minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery.

AWS Migration HUB: provides recommendation for EC2 instance selection, migration strategy and modernization.

Snowcone HDD has 8 TB of usable storage while Snowcone SSD has 14 TB of usable storage.

AWS DataSync comes pre-installed on AWS Snowcone to enable simple, fast, secure, and cost-effective online data import or export between storage on Snowcone devices at edge locations and Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, or Amazon FSx for Lustre file systems.

Snowball device --> Moves 80 -100 TB of data which gets shipped to S3 later.

Snowball Edge > 80TB. Comes with EC2.

Snowmobil migrate upto 100 PB of data.

AWS Transfer Family allows to SFTP/FTP data to S3 or EFS

AWS Application Migration Services

DataStores:

IOPS: How much read/write you want to per sec. Small data fast times.

Throughput: How large data you want to handle at a go. Large data

ACID:

Atomic: All or nothing.

Consistent: Data should be valid. Not corrupted.

Isolated: Write locks. One transaction mustn't mess with another.

Durable: Persistent.

BASE:

Basic Availability: Available though not latest info. Might be stale data.

Soft State: Not all data stores are updated at the same time.

Eventual Consistent:

S3:

Max single object size: 5TB. One single Put is 5GB. Mutlipart upload over 100 MB.

If S3 Bucket Policy allows/deny a principal then access is granted/denied irrespective of IAM permissions. Resource policy takes precedence over identity based.

If Bucket policy has neither allow or deny then IAM policy is checked.

IAM role assumed access key are rotated in the background. Cross account recommendation is to use IAM roles and not bucket policy.

Glacier:

Archival storage. Comes with faster retrieval option as well.

Used by Storage G/W Tape Library for backup.

Abort a Glacier Vault under 24 hours or it's assigned to vault post that

EBS:

Snapshot lifecycle policy for volume and instance. Removes old snapshots as well.

EFS:

Can be mounted across AZs. Based on NFS File share. Use DataSync to sync files from OnPrem to AWS to S3 or EFS as it's more secure.

FSx:

File Share Service Windows compatible. Non NFS. FSx for NetApp ONTAP, FSx for OpenZFS, FSx for Windows(SMB) and FSx for Lustre

FSx for Lustre is used for HPC workload. Has SSD for fast access and also S3 to backup less frequent objects.

Storage G/W:

File G/W: NFS and SMB. Allows OnPrem or EC2 to use S3 via Mount Points

Volume G/W: Async replication of OnPrem on AWS S3. Backup (iSCSI)

Volume G/W Cached mode: Frequent access objects cached on OnPrem and all primary data on S3 (iSCSI)

Tape G/W: iSCSI tape library

RDS:

MySQL replication is supported with InnoDB or XtraDB on Maria

StandBy instance have Sync replication whereas Read replicas have Async replication.

Aurora:

MultiAZ by design. AWS Managed Auto Scaling. Upto 15 Read replicas.

Global DB can have upto 5 secondary regions. Underlying storage blocks are replicated.

Serverless V2 allows to scale up memory, cpu & n/w capacity.

DynamoDB:

MultiAZ, key value store, cross region replication. GSI and local index

DocumentDB:

MongoDB Compatible, noSQL DB, MultiAZ. Designed for JSON Data

Redshift:

Petabyte scale data warehouse DB. PostgreSQL compatible BI Integration. Parallel processing and columnar data storage.

Redshift spectrum allows querying data directly from S3. Better than Athena if you have multiple joins

Data lake is heterogenous data stored in common place and then allowing querying the same. For eg. Dumping data in S3 and using redshift spectrum to query and visualize the same using Analytics tool.

Glue:

Glue performs ETL. Serverless offering.

Glue Crawler --> crawls data from different sources --> creates Glue Data Catalog --> which can be consumed by EMR/Redshift/Athena or can be fed to Glue ETL Job --> to feed it to S3/Lake Formation/Redshift and CW

Glue Data Quality --> recommendations and pre-defined rules for data quality

Athena --> Petabyte scale analysis --> queries data from S3 --> Can also run apache spark code on athena. Based on Presto.

Neptune --> Fully managed Graph database. Ideal for social networking or product recommendation engine. Supports OpenGraph API for Gremlin and SPARQL

Amazon Quantum Ledger DB: Ledger DB.

Amazon Managed Blockchain: Supports open source Fabric and Ethereum frameworks

ElasticSearch/OpenSearch: Search engine and analytics service. Has fuzzy logic and part of ELK(ElasticSearch(Search/Storage) , LogStash (Intake) & Kibana(Analytics/View))

TCP is request and response(ephemeral ports)

UDP is just one way. Used for streaming data.

Upto 5000 transit G/W attachment is possible with a single TGW.

Placement Group:

Clustered: Placed in Same AZ

Spread: Spans Multiple Azs. Max 7 instance per group per AZ

Partition: grouped into partitions and spread across racks

Direct Connect:

OnPremise --> Cx Gateway (BGP routing) --> Direct Connect- --> Private VIF --> Virtual Private G/W --> VPC

OnPremise --> Cx Gateway (BGP routing) --> Direct Connect- --> Direct Connect G/W --> Transit G/W --> VPC

Direct Connect G/W can only connect to 3 VPC

Site to Site VPN:

Fastest way. IPSec Tunnel. OnPremise --> Cx Gateway --> IPSec Tunnel --> VPN Connection --> Virtual Private G/W --> VPC

Global Accelerator: Provides edge based Static IP to load balancer. Improves latency. Multi region failover

Route 53: DNS Service port is 53

Control Tower: Landing Zone, guardrails(SCP + Config rules) & Baseline( CFN Stacks). Enforce and manages governance rules in multi account setup

A landing zone is a well-architected, multi-account AWS environment that is a starting point from which you can deploy workloads and applications. It provides a baseline to get started with multi-account architecture, identity and access management, governance, data security, network design, and logging.

IAM Identity Center: Uses IAM role, AD Connector for AD and also has federated identity

Resource Access Manager is enabled at Organization level. Allows sharing of n/w and resources. Only share resources within same region.

RAM --> create resource share. Add managed policies for those resources.

CloudHSM: FIPS 140-2 Level 3 certified. Current CloudHSM is pay as u use. Single tenant.

KMS is FIPS 140-2 Level 2 certified. Few cases of level 3

Rolling Update: Update the Auto Scaling to read a different version of launch template. And terminate old instances.

Swap environment URL of Elastic Beanstalk.

Clone stack in AWS OpsWork and update DNS

Code commit is equivalent to Git

Code pipeline : Orchestration layer

Code build: compiles and packages

Code Deploy: Deploy packages to compute . AWS and on-premise

Cloud9 : IDE ..similar to Visual Studio

CodeGuru: Does code review and provide code recommendations. Connects to github, code commit, github enterprise server and bitbucket.

CodeStar: Deploys the whole project. Integrated with IDE, Code commit, pipeline and etc

CodeArtifact: looks like nexus repo

IoT Core: MQTT protocol connects to IoT devices.

IoT Analytics: Streams real time data from IoT devices. TimeSeries data.

IoT Greengrass: Provides local run of containers or lambda for inference on IoT devices.

IoT Sitewise: Analyze data locally as well and also upload to cloud.

Cannot sell Convertible RIs on Marketplace.

Global Accelerator ENIs that are created only allow traffic to and from AWS Global Accelerator service. They do not allow any type of broad access to the internet.

DHCP Options: You can control the DNS servers, domain names, or Network Time Protocol (NTP) servers used by the devices in your VPC.

Amazon WorkSpaces Web is a low-cost, fully managed, Linux-based service designed to facilitate secure browser access to internal websites and software-as-a-service (SaaS) applications from existing web browsers.

AWS Budgets include the types of cost budget, usage budget, reservation budget and savings plan budget.

Using Simple Workflow Service was Amazons recommended method for human- involved workflows. However, Step Functions is now the preferred choice, as this requires less programmatic work to build workflows.

We can easily increase the size of an EBS from the console or the CLI (using modify-volume)

For RDS, Read Replicas require backups for managing read replica logs and thus you cannot set the retention period to 0. You must first remove the read replicas and then you can disable backups.

The Amazon S3 File Gateway enables you to store and retrieve objects in Amazon Simple Storage Service (S3) using file protocols such as Network File System (NFS) and Server Message Block (SMB). Objects written through S3 File Gateway can be directly accessed in S3.

AWS recommends using AWS Directory Service for Microsoft Active Directory if you want to establish a trust relationship with on-prem directories that have user bases more than 5,000.

CloudFormation has an optional Conditions section that contains statements to determine whether or not entities should be created or configured. Then you can use the "Fn::If" function to check the condition and return different values.

AppStream is a way to deploy an application on a virtual desktop and allow anyone with a browser to use the application.

Source platform Target platform Migration approach

VMware vSphere 6.0+ VMware Cloud on AWS 6.5+ VMware HCX

VMware vSphere AWS native VM Import

Existing VM vSphere can be migrated to VMWare AWS using vMotion or VMWare Hybrid Cloud Extension (HCX)

Application Migration Service or Cloud Endure Migration. Server Migration Service.

https://docs.aws.amazon.com/whitepapers/latest/overview-aws-cloud-data-migration-services/cloud-data-migration-challenges.html

Migration Strategies:

Re-Host: Lift & Shift. Move On-Prem MySQL to EC2 on AWS.

RePlatform: Lift & ReShape. Migrate onprem MySQL to RDS MYSQL

RePurchase: Drop & Shop. Migrate onprem CRM to Salesforce.com

ReArchitect: ReDesign app. Serverless

Retire: Get rid of app

Retain: Do nothing. Continue

Migration HUB: For tracking and dashboard migration activities. Migration HUB uses Appln Discovery Service

Application Discovery Service: Agent based discovery of workload. Encrypted data transfer. This is Just Discovery!

Application Migration Service(MGN) : Does the actual migration of onprem Linux and Win servers to Cloud. MGN Agent is installed on VMs. Sets up staging server with test instances and then does a cutover. Replicates source servers on AWS.

Note: MGN is also a good choice for region to region migration.

Every new block read from a freshly restored EBS volume must first be downloaded from S3. This is because EBS snapshots are saved in S3. Remember that EBS snapshots are incremental in nature. Every time a new snapshot is taken, only the data that changed is written to that particular snapshot.

Redshift is optimized for analytical queries rather than high-volume data ingestion. It would be better to first load the data into S3 and then perform data transformations into Redshift.

Unfortunately it is still not possible to store Application Load Balancer access logs in an S3 bucket which is encrypted using SSE-KMS

Compute Optimizer: Can run on single account for whole organization. Runs ML to give recommendations on rightsizing(over and under provisioned)

Compute Optimizer works with EC2, EBS, Lambda and ECS Fargate.

Kinesis records can hold upto 1 MB of data. Retention period of default 24 hrs max 365 days. One shard is 1000 writes request per sec. Have more partitions to have more shard

Kinesis Data Streams: Streaming data ingestion

Kinesis Firehose: Persist data to different destination S3, Splunk, Elasticsearch, Redshift

Kinesis Data Analytics: Does the analytics on the go.

DynamoDB capacity: 1000 WCU or 3000 RCUs. Partitions decided by WCU/RCU calculations or Size (10GB for 1 partition).

IAM identity center is used for AuthN SSO via Browser mode only.

Cross Account Access with resource based policy doesn't need Principal to give up his access in base account.

Public VIF with Direct Connect allows to access AWS resources.

Use trusted-access to enable a service as Trusted service to be shared via RAM enables-sharing-with-aws-organization

Kinesis Producer Library is used to send data to Amazon Kinesis Data Streams.

Amazon RDS doesn't support certain features in Oracle such as MultiTenant DB, Real Application Clusters(RAC), unified Auditing, Database vault and etc. We need to use EC2 in such cases.

Reserved concurrency prevents lambda to use unreserved pool of concurrency limit available.

AWS Cloud Adoption Readiness Tool (CART) develops efficient and effective plans for cloud adoption

Use DMS service even for migration DB to EC2 on AWS.

Cloudfront Signed URLs are for single file. Cloudfront signed cookies is for multiple files.

Elastic Beanstalk deployment mode support is for all-at-once, rolling, rolling with additional batch and immutable. Immutable keeps existing stack as-is.

Aurora Auto Scaling is for Read Replicas.

DynamoDB Reserved Capacity units purchase to lower the cost.

For Centralized rule-based filtering with Network Firewall , you need Transit Gateway to act as N/W hub.

AWS AppSync allows to connect to Data sources such as DynamoDB, AWS Lambda with GraphQL APIs

Security HUB integrates with Macie, guard duty, config, inspector, firewall manager. Also integrates with 3rd party security solutions over marketplace. It generates recommendations as well.

N/W Firewall is Intrusion Detection and Prevention service

Firewall Manager allows firewall rules setup across organization. Standardize security tools such as WAF, SG, N/W Firewall across accounts.

Amazon WorkSpaces provides the complete VDI setup. AppStream just provides services/resources via browser..used for Application hosting.

Amazon worklink proxies internal applications on internet. Amazon AEA App.

Amazon comprehend does sentiment analysis of text input. Social media

Amazon forecast. Forecast data based on timeseries.

Amazon Lex for Chat/Conversational

Amazon Personalize builds recommendation engine based on behavioral and geo data

Amazon polly: text to speech

Amazon transcribe: translation of video ..subtitles.

AWS config rules can be used to check if tagging is done or not.

Starting with Few Important components/services:

EBS is persistent storage ideal for database, file system.

EBS is made fault tolerant by creating snapshot. To have snapshot consistent, stop the instance or flush memory

Backup strategy

Retention period of snapshot.

Snapshots are stored in S3.

Snapshot needs to be copied in other region and then volume can be created out of it.

EBS volumes are AZ specific.

Note: DR strategy for all DBs are around snapshot creation and creating DB in another region.

Resilient: handle exceptions, graceful handling

Modular: High Cohesion (keeping similar kind of entities together + low coupling)

AWS Step Functions for Orchestration

Use SQS or messaging/aync functions to decouple services

ELB and ASG: Regional services. If EC2s in different region, use R53.

Mulitple AZs to be used for fault tolerant.

DR Approaches

-Backup & Restore (Slowest + low cost)

-Pilot Light: only few critical components are on cloud

-Warm Standby: Smaller env with all processes

-Multi Site (fastest + highest cost)

With a S3 Stored Storage G/W Snapshot, we can launch an EBS volumes or another storage g/w on an EC2

Storage G/W can also provide TCP data for EC2 on AWS

S3: Object storage (photos, videos, files). AMI, snapshots are stored in S3. By default, multiple AZs, cross regions.

RDS: Multi AZ , read replicas. Automatic backups. Transactional logs are stored in RDS with logs 5 mins back. RDS DB Snapshot (manual backup)..no transactional logs

Multi AZ has synchronous Slave for a Master for Multi AZ RDS.

Consolidated Billing: (AWS Organizations)

Turn off sharing of RI instances

Master account is called Payer account.

AWS budget uses cost explorer.

Redshift doesn't work with spot instances.

AWS Organizations:

Creates a Root account which would have OUs or individual account under it. Root is a logical entity.

OU is an container of OU and accounts. OU can only have single parent.

An account can be member of only one OU.

Only 1 Master account which is the Payer account. Single payer account. Logging, control.

All other accounts are member account.

Service Control Policies

AWS Organizations is eventually consistent to replicate settings to all regions.

SCP has no effect on the Master/primary account. Apply SCP at OU level.

AWS organizations allows us to have common LDAP services, shared services.

AWS Organization creates service linked role for Master account with member account.

RTO: Recovery Time Objective- Time taken to restore service after crash.

File G/W: Files are stored in S3.

Storage Gateway: is Block storage. You take snapshot which is copied to S3. Create ec2 from that snapshot. iSCSI interface is placed on on-prem and interface interacts with AWS Storage Gateway service in AWS. Storage G/W is for backup…store.

Volume G/W:

Cached Volume G/W would cache frequently access data on-prem and rest of the stuff on AWS.

Stored volumes: on-prem has main set of data which gets backed-up on AWS asynch

Glacier: Lowest cost storage. Retrieval times needs to be considered.

Snowball: If data takes more than 6-7 days to transfer/copy , use snowball.

VM Import/Export to S3 is free of cost by AWS.

CloudFormation:

Works as Infra as Code. Template creates a Stack.

Change set is created based on delta based on template update.

Resources must always exist in Cloudformation template.

Naming conventions for "Type": AWS::Name of the service: :Instance

Conditions is how you control the template.

Intrinsic functions can be used in Resources, metadata, outputs and update policy attributes.

GetAtt, Ref, FindInMap, GetAZs, ImportValue are few intrinsic values.

CreationPolicy: injects dependency on other resource creation. waitCondition: allows template to wait/delay for creation on policy.

Delete Stack Template would delete all the components created.

Deletion Policy attribute: For DB, snapshot or deletes.

Nested Stacks: reuse common templates. For eg. ALB

Elastic Beanstalk:

Least control on Infra. OpsWork sits in middle . CloudFormation highest control on Infra

Beanstalk created EC2 doesn't have EC2 volumes backedup (ethereal storage). So, do not create RDS with beanstalk

Beanstalk focuses on Application. An application version points to S3 which has the deployable code. (war, ear, jar, zip)

Environment (runs one application version)

Beanstalk runs an Agent Host Manager on EC2 for application monitoring, log files rotation and publishing logs to S3. Only available with Beanstalk.

Packer open source tool use to create AMI.

Single container/ multiple container docker image.

Beanstalk creates S3 bucket with beanstalk-region-accountid

Beanstalk allows custom web server deployment.

AWS Ops Works: Deploying and monitor instances, ALB, DB and application. You CANNOT change region of the stack once created.

Chef : Configs are universally applied.

Cookbook : Contains the configs and instructions called recipes.

Recipe: written in Ruby is set of instructions abt resources and their order.

Stack needs region and operating system (win or linux). Can have multiple layers.

Layers work on same set of instructions. Each layer has its own recipes.

Ops Work Lifecycle: Setup, Deploy, Configure, UnDeploy and Shutdown

Instances: OpsWork installs the agent on instances.

Instance type: 24/7, Time based & load-based.

When communication between OpsWork and OpsWork agent on instance breaks --> Auto healing starts

OpsWork doesn't support Application Load Balancer..only classic.

OpsWork supports CloudTrail logs, event logs and chef logs.

Only 1 CLB per layer in Stack.

AWS Config:

Helps in audit, maintenance and compliance

Overall view of resource.

Configuration item is created whenever a change is recorded on the resource.

History is collection of items. Stored in S3 and SNS.

AWS Service Catalog: creates Portfolio which uses cloudformation

Product is application. Catalog is collection of products. Products are delivered as Portfolios.

Service catalog has constraints which restricts cost and components.

Launch, notification and template constraints.

Cloudwatch:

Metrics are Data points created regionally.

Namespace is container for cloudwatch metrics.

Alarms: OK, ALARM & INSUFFICIENT DATA

Period ( in seconds), evaluation period ( no. of datapoints/ per period), datapoints to alarm: how many datapoints to raise alarm.

Alarms can trigger EC2 actions, ASG or SNS actions..NO LAMBDA or SQS

Empty Log stream retention period is 2 months only.

Log retention from 1 day to 10 years. Logs are stored indefinite.

Cloudwatch logs insight needs JSON based events.

Encryption and metric filters are applied at Log Group level.

Unified cloudwatch agent: works with windows as well. Faster than old one.

Cloudwatch logs --> kinesis, kinesis streams or lambda (real time)

Cross account logging possible

Aws events can be shared with other aws accounts.

Synthetics: Canaries scripts tries to mimic Customer actions. Checks API latency and endpoints.

ServiceLens: integrated with Xray to provide end to end view of application.

Systems Manager:

Needs ssm agent to be deployed and running on host.

Actions: run command, session manager, patch mgr, automation, state mgr

Maintenance window

SSM resource groups are collection of resources in a region

State manager to run scripts on a recurring basis, patch updates, software updates

Amazon QuickSight for visualization

Resource DataSync syncs data from multiple accounts.

Symmetric key better than asymmetric keys

CSR Request --> X509 Types --> private key --> certificate chain

CSR should have CN (FQDN), Country and etc

PEM format (Privacy Encoded mail)

SSL or Session Keys are generated for the session only

AWS VPC:

5 IPs of every subnet is reserved. First 4 and last.

1st - base n/w

2nd- vpc

3rd - router

4th - future use

5th - last ip

VPN setup over internet ipsec based is fastest to achieve connection with on-prem. Performance could be slow. VPN has option of static and dynamic routing.

Direct connect only supports BGP Dynamic routing.

10 customer G/W can connect to 1 VGW (Soft limit) through ipsec tunnel

BGP is Dynamic Routing Mechanism

Autonomous System Number (ASN) ( Eg. Customer G/W and AWS G/W) used by BGP

LAG (Link Aggregator Groups) joins the links together as one. The links must be of same bandwidth.

BGP Communities for routing preferences.

Route table needs to point to VGW. There's no target with direct connect or vpn.

BGP prefers Direct Connect over vpn site to site

Static route is preferred over BGP (Dynamic)

Longest subnet mask is preferred ( /24 over /16)

From customer g/w to vpc, we need to configure local preference or route for BGP

Direct Connect G/W is a global resource and is not linked to a single region. This cloud be attached to multiple regions VGW.

You can ONLY attach one VGW with one Direct Connect G/W.

Private or public VIF could be max 50 per direct connect.

Cannot have more than 4 Dedicated connections per LAG. 10 LAGs per Region.

200 Direct connect gateways per account.

Inter region peering is allowed.

Transit gateway are regional resource. Direct connect g/w is Global.

Transit G/W attaches a VPC or VPN. Doesn't connect to Direct connect

Enhanced EC2 networking -IOPS - SR I/OV- Low latency- HVM AMI

Spread placement group provides Distinct underlying hardware. Max 7 instances each AZ.

NAT G/W is per AZ.

EC2 creates ENI primary by default. ETH0 types

Interface endpoint is an ENI entry.

Load Balancer:

NLB supports TCP and TLS

handles million of request

access log, cross zone LB are disabled by default

lambda as target type not supported for NLB

if instance id is the target type, ec2 instances get the client ip directly

in case of IP address, we need to use proxy protocol

microservices works with IP address

Proxy protocol enables actual client headers to be sent ahead (Only for TCP /layer 4)

For https/http, use x-forwarded- for header

ELB is region specific

Non standard webserver health check should be done with TCP instead of http

Session affinity/stickiness is cookies based

ELB doesn't support 2 way authentication over HTTPS. Client side certificates are not checked.

TCP would allow via proxy protocol settings .

ALB supports Server Name Indication (SNI) certificates (multiple certs pointing to same IP). CLB doesn't support SNI.

Re-resolving of DNS is important for ALB to respond correctly. Caching caches IP

Access logs are disabled by default. Details of client, protocol and etc.

API calls are in Cloudtrail.

100 rules per ALB

One Target Group Can only be attached to one Load balancer.

Target of target group can be ec2 instance, ecs application, Private IP address or one lambda function.

Cannot register the IP of another ALB in same vpc as target.

On-prem instance's IP address can be used as well as same ip address with different ports (microservice).

DNS:

CLB/ALB/NLB, CDN, S3 are routed using DNS. IPs can change so use Alias record.

Cannot create CNAME for apex/naked domain name.

Routing policy: Simple, failover , geo-location, latency, weighted routing policy.

Weighted routing policy has weights defined from 1 -255

No health check would be treated as healthy target. Evaluate target health no would not care about health check.

About 8 recordset as part of multi-value answer r53 policy

R53 Resolvers: (regional service)

Inbound(on-prem to AWS) and outbound (aws vpc to on-prem)

Internet resolver is created by default.

CloudFront:

Global service - not regional

PCI DSS and HIPAA compliant

For dynamic content loading, ttl =0 and use query strings

Streaming distribution is RTMP CDN, progressive/download is web distribution

Rtmp is for adobe media server streaming

CDN cache get and head type request

Signed URLs , cookies and object access identifier

URL should have valid end date and time for validity

Cloudfront Access logs needs to be enabled

Cannot use signed urls or cookies if existing url uses expires, policy, signature, key-pair-id

TTL Cache keeps sending GET/Header call to Origin with a flag isModifiedSince

Default TTL is 1 day 86400

S3 bucket only allow http connection. No https

Server Name Indication (SNI) for multiple certificates . SNI should be supported by browser.

Chunk encoding is supported by Apple HLS (HTTP live streaming), Adobe http dynamic streaming (HDS), Microsoft smooth streaming

Use elastic transcoder to convert video to HLS.

Signed cookies better choice than signed URLs for media streaming.

Signed URLs are more appropriate for static content.

RTMP should have only S3 bucket as origin and should also have web distribution for media player

Cloudfront viewer reports --> user location, devices, browsers

Compute:

AutoScalingGroup:

PCI DSS compliant

One EC2 instance can be part of only one ASG

Instance states -> pending, health check --> in service --> terminating --> terminated

In service to standby or detaching state

Regional component

Merge ASG only via CLI

Suspend ASG policies to troubleshoot EC2s

VMs concept: Traditional approach is Physical H/W --> OS --> Apps

VMs: Physical H/W --> hypervisor --> VM (OS + Apps) ..supports diff OS

Containers: Applications and binaries (libraries) are packaged together

Docker has application, libraries and runtime.

ECS -> Task Definition --> Docker Image

Kubernetes is container management system

Docker Enterprise Edition is also container management system used by ECS

Fargate launch type --> serverless container

AWS Glue is for ETL.

ECS is regional service

Task Definition can have upto 10 containers defined

Create Task Definition (max 10 containers) --> create service (no. of tasks required) --> run the service --> creates tasks which are running containers --> accessed by ENIs

Container agent runs on every container which is EC2 type and reports running tasks and resource utilization.

Clusters are region specific. Can contain both EC2 and Fargate launch types.

ECS Service is kind of ASG. Runs behind a ALB

Only one LB/TargetGroup per Service

AWS doesn't recommend one CLB infront of multiple services.

Mapping of container port to host port is dynamic port mapping in alb

If host port is 0 then it becomes dynamic host port

Lambda:

Trigger based. Passes events. S3, SNS, API triggers.

Configuration for lambda: Memory (128 -3 gb), Maximum execution time (3- 900 sec), IAM execution role.

Networking of lambda would need the vpc, subnets and security group

Invocation of Lambda:

Event sources: SNS, S3, Dynamo DB and etc

Https/ rest: API Gateway backend

Aws sdk: codes to call lambda

Events: scheduled or cron job with lambdas

Event source mapping is done for lambda triggers.

Event sources (S3, CDN) maintains the mapping for lambda

But streams based (Kinesis and DynamoDB) are maintained at lambda

1000 concurrent execution for lambdas per region

AWS layers is zip packaged with runtime, libraries and code.

For Async lambda, the function retries twice.

Dead Letter Queue is configured to handle the lambda async responses.

Stream based polling will stop the lambda processing if there's an issue.

SQS based polling would return the message in the queue if not processed and would be available in the queue after visibility timeout

Lambda@Edge needs to be created in us-east-1. Assign upto 4 lambda on CDN

Also used for http redirect, auth functions

API Throttling is ..how many get/put request being allowed per sec. after that http respoce 429 is returned.

API GW Caching is chargeable per gig of storage

API GW Proxy integration passes the client info to backend system

AWS SAM is based on cloud formation for serverless application

AWS Batch uses containers to run

Batch --> Job --> Job Definition --> Job Queues --> Priorities

Storage Service:

5GB file upload needs multipart. Upto 5TB

Once version is enabled on S3 , it cannot be disabled only suspended.

Delete marker if file is selected not version

Storage Classes:

S3 Standard --> highly durable. Suited for frequent access

RRS (Reduced Redundancy Storage) is not recommended and is on demise path

Infrequent access, IA_One Zone. Suited for infrequent access

Intelligent tiering. Expensive then IA

Glacier -> suited for archival. Retrieval time is upto 5 mins with special provisions

Glacier deep archive needs 12 hours

IA, IA_One_Zone and intelligent tiering have minimum storage for 30 days and size of 128kb

Glacier for 90 days and deep archive for 180 days

Glacier by defaults encrypts data being stored.

Data cannot be uploaded directly via console into Glacier

Not to use Glacier when we have real time data retrieval or frequently changing data

S3 Static hosting URL : bucketname.s3-website.region.amazonaws.com

For S3 Pre-sign URL: from CLI, aws s3 presign --expires-in (secs) s3 location of file

S3 not suited for dynamic data, or long archival (Glacier) data

S3 Object replication also copies metadata, ACL and object tags. Supports Cross Region as well as Same region replication

Delete marker not replicated

AES 256 Encryption is also called SSE S3 .

S3 allows 3500 requests per sec for PUT, COPY, POST and DELETE

5500 for GET and HEAD per prefix

Amazon s3 and glacier select to query s3 data

EBS Backed EC2 volumes. Instance store is ephemeral.

Instance store provides higher IOPS.

For snapshot of root volumes, stop and take snapshot. Cannot detach root volumes

Redundant Array of Independent Disks (RAID) volumes.

RAID 0 has the best IOPS performance. RAID1 has redundancy. RAID10 is mix of both

EFS follows NFS protocol. Shared by multiple EC2s in a VPC

Mount targets for EFS are ENIs. ENI to be created per AZ not per subnet

EFS Suited for BigData and Analytics (high throughput, readafterwrite consistency and low latency operations)

Media processing workflows (video and audio), Content management (web serving)

And home directories

EFS allows 2 storage class (Standard and Infrequent Access) both highly durable but retrieval charges higher in IA storage and cost of storage is less.

EFS lifecycle policies shift files from Standard to IA. File metadata always stored in Standard.

NFS Port 2049 should be allowed for EFS

EFS mount points are created on EC2 instances for sharing

AWS Backup Service used to back EFS data

AWS DataSync to help migrate data from on-prem to EFS or EFS to EFS

Open after close consistency and strong consistency

Performance modes: general purpose (low latency) and max i/o. cannot change once created

Burst and provisioned throughput modes

FSx Windows File Server supports windows and linux mode. Supports SMB(Server Message Block) windows and CIFS(Common Internet File System) protocol.

EFS supports only Linux whereas FSx supports Windows as well. Needs AD Integration for win

SMB port for TCP/UDP 445, RPC at 135

FSx also works with ENI

FSx Lustre is High Performance Computing, distributed, low latency,

Works with linux servers. Needs fsx client to be installed on the linux servers with mount for lustre

Not a repo to store long term data..use S3.

port 988 to be opened for FSx Lustre. No data replication or Multi AZ support for Lustre

SQL Database:

RDS allows Multi AZ synchronous replication and standby instance

Use provisioned iops for multi AZ setup

Loss of primary DB, patching, ec2 or ebs failure would lead to Failover

RDS Failover happens using DNS name

Automated backups are taken from StandBy Instances

RDS Read Replicas are created from primary instance for read operations and data is copied asynchronously.

Read Replica can also be Multi AZ.

Aurora read replicas are synchronously replicated. No standby.

Separate storage and compute for Aurora. Read replicas can be promoted to Primary

Aurora supports PostGreSQL and MySQL

Cluster endpoint is connected to Primary DB. Gets updated incase of failover.

Reader endpoint --> load balances to all reader replicas

Instance endpoint --> direct connection to instance

Custom endpoint --> logical grouping created by user

Aurora can have 15 read replicas. ASG should have at least 1 replica.

Aurora failovers to read replica

For Encrypting Aurora Cluster, we have to take a snapshot and create cluster from snapshot and choose Encryption.

Aurora Global DB: Primary in one region and secondary (read only) in another region. Secondary has 16 Aurora Replicas for readonly

Aurora MySQL can query data from S3 directly

5 Cross Region Read replicas MySQL

Cross region replication happens when clusters r publicly accessible

Aurora DB has Multi Master clusters. Master has read/write both

Aurora serverless works with Aurora Capacity Units..combo of vCPU and Storage

Aurora serverless a good option for reporting or unpredictable loads

Supports only 1 AZ

ElasticCache:

In memory data. Reduces read workload

Has EC2 clusters running in backend. Automatic failover happens

Memcached - plain cache..no DB

Redis - noSQL DB

Memcached is not persistent. Good for stateless app transient data

Doesn't support MultiAZ. No encryption support.

Redis: multiAZ, persistent, snapshotting of cache. Supports Pub/Sub.

Copying Redis snapshot to different region needs copying snapshot in s3 and moving it in that region

Have 2 nodes in Redis Cluster with Multi AZ setup for Automatic Failover.

Complex data operations

Dynamodb:

Supports 3 AZ replication of data

Max 400KB datatype storage

Partition storage back by SSD drives.

Global Secondary Index can be any 2 attributes

GSI has separate storage

Local Secondary Index needs to be with same Primary Key (type and name)

DynamoDB backups cannot be copied to other regions

DynamoDB restore happens to new table

DAX (DynamoDB Accelerator ) for caching micro seconds response

DAX deployed in VPC. Runs in clusters with 9 read replicas and 1 primary

DAX TTL is 5 mins

DynamoDB streams are stored for 24 hrs

TransactWriteItems and TransactGetItems commits or fails all

Sparse indexes doesn’t copy GSI data if value of sort key is empty

Analytics:

Redshift: OLAP DB for Data Warehousing purpose.

AWS Managed. DWH solution for structured data (RDBMS).

Supports replication. Doesn't support huge real time streams ingestion.

Columnar data storage. Min 160GB size

Snapshots stored in S3. Retention 0-35 days

Redshift supports single AZ

Athena:

Queries S3 using SQL kind statements. AWS Managed.

Integrated with QuickSight for visualization support.

For fast retrieval , store data in columnar fashion using EMR. (Apache Parquet & ORC)

Kinsesis:

Streams of data. Multiple sources sending KB /MB of data.

Kinesis is AWS managed. Terabytes of data per hr

Kinesis stream take the data real time and convert them into shards.

Data from stream can be sent to dynamodb, S3 or redshift

Kinesis stream retention by default is 24 hrs

Replicates synch across 3 Azs

Firehose:

Firehose takes streams of data (logs, IoT or Kinesis streams) as the input and delivers the streams to services such as S3, ElasticSearch, Splunk and Redshift

Firehose also transforms the data. Encrypts, compress and batch is also available.

For sending streams to Redshift, first send the streams of data to S3 and then transform

Kinesis Analytics:

Used for running analytics on streams data. Allows running queries to your streams data

EMR (Elastic Map reduce):

Uses hadoop framework with EC2 (clusters) and S3(store input and output).

EMR is used for web indexing, data mining, machine learning, bioinformatics etc

Apache Hive is for DWH, hBase is distributed DB, Spark is compute engine and Apache Hadoop is Software

Hive is used for Querying the DWH using Hive QL.

Pig is used for Analytics ..Pig Latin

EMR has Master node (distributing load. Has software and hbase DB)

Core node..used in Multi node setup. Has software and hbase DB..no distribution logic

Task node..optional component. Only has software..apache hadoop..no persistence. Spot instanc

EMR runs cluster in single AZ.

EMR is not for real time ingestion..use kinesis. It needs S3 to store the input data.

Kinesis connector allows ingestion of streams data into EMR using Hive QL or Pig script

AWS DataPipeLine:

Orchestration which allows on-prem and cloud operations. Moving data and tansforming.

Task Runner: Needs to be installed on computes (on-prem VMs or EC2 or EMR Clusters). Communicates with the Pipeline

Data Nodes: Specifies the input and output data nodes. SQLDN, RedShiftDN, DynamoDBDN

Database supported: RDS, DynamoDB and JDBC

Quicksight:

DataSources --> DataSets (transformed data) --> Analyses -->Visuals --> Dashboard

SPICE is the engine for Quicksight. Superfast Parallel InMemory Calculation Engine

Import the data into SPICE

GLUE:

Fully managed ETL service. Glue Crawlers, Job Authoring (Transform), Job Execution

Glue Data Catalog: Central repo for storing metadata

Glue Crawler connects to datasource and identifies the type of DS and populates Data catalog

Use Glue when needs to run ETL with S3 and query using Amazon Athena and Redshift Spectrum

Glue is based on Apache Spark..unlike Data Pipeline (EMR)

Kinesis Video Streams:

Intakes streams of data from video, images producers and sends it to EC2s

Usecase: need to stream videos to many devices real time

Put media supports only mkv format

AWS X-Ray:

Helps in tracking operations of distributed applications.

Has a trace ID. Segment

Not an audit or compliance tool

X-Ray Agent- installed on service to send traces. EC2, ECS, Lambda

X-Ray Group is group of traces to generate service graph

Security:

IAM Role: Assumes permissions using STS for the policies attached.

Resource based policy: allows to have cross account policies without IAM roles being compromised

Service Role: role specific to services. Have permissions defined

S3,Glacier, SNS and SQS allows Resource Based Policies

AWS STS: Secure Token Service to grant temp credentials to trusted users.

Assume role : using IAM roles

Assume role SAML: SAML 2.0 Identity Provider..corporate AD

Assume role with WebIdentity: OIDC web identities such as Google, FB

GetSessionToken: used for MFA

GetFederationToken: used by IAM users

For Web identity federation login, register as a developer with IDP. You'll get a developer id

OIDC ID from IDP as trust

SAML 2.0 IDP. For enterprise based AD login

AssumeRoleWithSAML call includes ARN of IDP XML, ARN of role requested and SAML assertion from IDP (ADFS)

AWS Signin URL: https://signin.aws.amazon.com/saml

Identity Broker calls GetFederationToken directly to STS

Active Directory Services:

AWS Microsoft AD: Suitable for a fresh AD setup.

AD Connector: connect to on-prem AD for AWS services. Existing
Simple AD. Good if we have users less than 5000. Samba 4 AD

AWS Microsoft AD supports Simple Edition (upto 5000 users/30000 objects) and Enterprise Edition (upto 500K objects).

This is only setup compatible with RDS Microsoft SQL DB.

For authentication via AD between AWS and on-prem, we need VPN setup. Direct connect is not encrypted but VPN is.

AD Connector is a proxy service for on-prem AD setup.

Key Management System:

AWS KMS is FIPS compliant. Cloudtrail monitors the usage

KMS is Global but keys are regional

CMK is used to create Data key which is used for encrypt and decrypt

Customer Managed CMK and AWS Managed CMK (S3 default)

CMK keys have key policy to decide user permissions

GenerateDataKeyWithoutPlaintext

AWS Server Side Encryption: SSE-S3, SSE-C & SSE-KMS

SSE-S3 is free

Header x-amz-server-side-encryption:aws-kms

Data key/volume key/object key are all same

Glacier automatically encrypts data at rest with AES 256.

Storage G/W uses SSL in transit

EMR uses SSE-S3 while copying data in S3

RDS optionally uses AWS CMK to encrypt

Cloudtrail:

CloudTrail trail can be created to send trail logs to S3 bucket

Cloudtrail logging for all regions is sent to one S3 bucket

Allows 5 trails per region

Stores 90 days of data

Cloudtrail to S3 is SSE-S3 encrypted

Cloudtrail can be configured to receive logs from different AWS account

For putting the logs in S3 from different account, use bucket policies

For reading the logs in S3 from different account, use IAM role

Limit access to S3 bucket storing cloudtrail logs. Provide readOnly and MFA

CloudTrail allows log integrity check using SHA 256 hashing and digital signing.

DDoS: Distributed Denial of Service Attacks

Attacks happen at layer 3,4,6 & 7 (network/transport and application)

Spoof attack: Spoofing of Target IP as src sent out to mediators/reflectors which then tries to respond back to Src creating multiple magnified responses.

TCP SYN Attack: SYN --> SYN-ACK --> ACK. In this case, ACK is not sent by the attacker and connection are reserved making them unusable for any other use.

HTTP Flood Attacks: Emulates human interaction and sends HTTP traffic

Cache busting attack: by using different query string, bypassing the CDN to fetch results from origin servers

DNS Query Flood attacks: flooding the DNS with different DNS names

Mitigation Techniques:

AWS Shield: Free Standard edition by AWS for network atacks

Use R53, Cloudfront, WAF

AWS Shield Advance: provides may features for DDoS attacks

ALB/ELB doesn't allow UDP traffic

CDN closes SYN Attacks..half connections

WAF allows Web ACL rules

Reduce Surface Attacks: Minimize Internet access and reach. Use CDN ..no EIP..No DB in public subnet

Use CDN, ELB, Private Subnets, API GW for obfuscating resources.

WAF works at HTTP HTTPS Application layer for CDN and ALB, API GW

WAF blocks Cross Site Scripting XSS attack, SQL injection attacks

WACL have rules to allow, block or monitor (count)

3rd Party WAF is also a good solution to obfuscate internal components

Intrusion Detection System & Intrusion Protection System

Host Based and Network Based Intrusion Detection System

Promiscuous port/ network tapping/sniffers not allowed in AWS

Introduce IDS/IPS 3rd Party sandwich setup (ELB --> EC2 with IDS --> ELB (APP)) for security

RAM: Resource Access Manager...sharing of AWS resources among AWS accounts

Sharing of Transit G/W, Subnets, R53 rules allowed to share.

RAM eliminates the need to create duplicate resources in multiple accounts, reducing the operational overhead of managing those resources in every single account you own.

SNS: Push based

By Default, only topic creator can publish message but can provide access to other AWS users to publish message

SNS Mobile Push Notifications for Mobile notifications. Pop up

SQS: Pull based,used for decoupling the application

Delete the messages in the queue without deleting the queue

SQS messages can be processed and then forwarded to other SQS

Standard Queue: no sequence, duplication of msg possible, at least one delivery, high throughput

FIFO queue: limited throughput 300 TPS, exactly one delivery, FIFO

Polling mechanism: Short and Long Polling (preferred)

Short returns the response immediately even if queue is empty. Receivemessagewaittime is = 0

Short is default

LongPolling is preferred option. ReceiveMessageWaitTime> 0 and <=20 secs

Default retention period is 4 days with max 14 days and min 1 min

SQS queues can have priorities ..higher ones are handled fast

Create as many queues you want

SQS Visbility timeout is the time for which the message is locked by an instance for processing and no other instance would be able to pick it up. Only after visibility time out is done, message is sent back to queue.

30 secs bydefault upto 12 hours

Once ACK is received the message should be deleted from queue

SQS is regional component .HA in Multi AZ setup

IAM policies on SQS can decide who can send and receive messages

SNS topic can publish message to SQS as subscriber

Amazon Mechanical Turk:

Online crowdsourcing platform where requester can post work and worker

can accept and work on the same.

Amazon Rekognition:

Rekognition reads images and videos. Input can be uploaded as binary to rekognition or uploaded to S3.

Output of rekognition can be S3, Kinesis streams & SNS.

If input is Live video stream so need to use Kinesis Video Streams and output would be Kinesis streams.

Has recognize celebrities API. Can detect labels like tree, flower, table etc. Events like wedding, party. Landscapes.

DetectLabels API (Images) StartLabelDetection (Videos)

DetectFaces(Images) StartFaceDetections(Videos) CreateStreamProcessor(Streams)

People path ..activity tracking..only in videos. StartPersonTracking API

DetectText API to read from Images.

Jpg or png allowed for S3 stored images. Base64 for direct image

AWS Simple Workflow System: SWF

Orchestration of an asynchronous workflow system

SWF has workers. Workers performs the tasks and reports back

Workers perform tasks. Tasks can be done by person or machine

SWF has deciders which decides which activity tasks to perform.

SWF does long polling.

3 types of tasks:

Activity tasks: used by workers

Lambda Tasks: executes lambda

Decisions Tasks: used by Deciders to decide next actions

One domain SWF Workflow is independent of others

SWF is Task Oriented. SQS is Message Oriented

Workflow execution can run upto 1 year

AWS Step Function: Better SWF with Visual workflow

Automatic trigger and retries

Amazon states language

Activity Tasks (polls for activity ) and Service Tasks (calls another service push)

API G/W can trigger Step function with a specific state

AWS Application Discovery Service: Helps in migration effort

Integrated with Migration HUB

Agentless Discovery Service: Sits on VMWare and collects data on cpu, ram and disk io

Agent Based Discovery: Installed on Physical servers. Windows and linux supported

Information pushed to S3. Integrated with Amazon athena and quicksight.

Data exploration must be Turned On.

AWS Storage G/W:

Connects on-prem using iSCSI interface. VMDK file is installed on on-prem servers

File G/W: NFS type. Transfers file to AWS S3

Volume G/W:

Storage G/W: iSCSI type. Uses AWS for storage perspective (backups)

Cached G/W: Frequently accessed data is on-prem and rest of the data on AWS. Block mode

VTL G/W: Virtual Tapes storage in AWS

File G/W connection would need gateway to be created. Hyper V, vmware or computes.

Create a file share and mount the same on the servers.

Snowball Family:

Devices to do large scale data migration

Good choice for data > 10 TB

If the link can transfer data in a week..use link..more than that use Snowball

Snowball--> 50 (Only US) - 80 TB device. Plain import/export jobs

Snowball Edge: 100TB device with compute

Snowmobile: Exabyte storage truck. 1Exabyte = 1000 petabyte = 1000000 TB

AWS Migration HUB:

Services to do the migration. DB and Server Migration Services. Allows central tracking of these services. Used with Discovery services as well.

3 ways Migration HUB gets the data from on-prem:

Migration HUB import
Agentless Discovery Agent (VM)
Agent Based Discovery Agent

AWS Server Migration Services:

Used for on-prem migration of server VMWare, HyperV or Azure virtual machines.

Server migration Connector gets on-prem data as AMIs. Uses Cloud Formation template to create the stack. So, AMIs --> CloudFormation . Template defines DB, App and other layers.

AWS Database Migration Services (DMS):

DMS is used for migration from on-prem to AWS and vice versa.

Also used to maintain the replica of DB.

Supports security at rest and in-transit

Creating indexes, primary keys and etc on target DB can be done with the help of SCT tool

Replication Task has the rules and actual tables defined.

SCT Tool is installed on-prem. Clones the source DB and uses the agent to copy data to S3. DMS copies from S3 to target.

DMS instance comes in 50GB or 100GB sizes.

DMS supports MultiAZ support.

Source & Target endpoints to have the connection of DB. You can test the connections

No CDC enabled for MS Azure SQL migration. IBM DB2 as source and not target.

MongoDB is document DB NoSQL. Document is row and collection of docs is collections ..table

MongoDB as source and not as target. Document and Table mode supported

Document DB, Redshift, Elasticsearch , kinesis cannot be source DB

MySQL, Oracle can be source as well as target for DMS.

AWS Amplify is a development platform for building secure, scalable mobile and web applications. It makes it easy for you to authenticate users, securely store data and user metadata, authorize selective access to data, integrate machine learning, analyze application metrics, and execute server-side code. Amplify covers the complete mobile application development workflow from version control, code testing, to production deployment, and it easily scales with your business from thousands of users to tens of millions.

Key Notes:

--> Custom Application should have listeners configured for TCP.

--> Additional ENIs have a mixed MAC address that does not change. ETH0 is the Primary ENI which cannot be detached.

--> you cannot point an A record to Load Balancer. Either Alias or CNAME.

--> A low RPO solution is asynchronous replication

--> for encrypting an EBS running volume, create a snapshot and launch a volume from that

--> On-prem and AWS CIDR Overlaps won't allow communication

--> Enhanced networking instance and EC2 IOPS types for low latency and hight n/w throughput

--> RAID 0 increases Read and write capacity

--> SSE-KMS is envelope encryption

--> SSH/RDP to limit access to ec2 NOT IAM roles

--> Advertise AWS Public Ips over Public VIF not Private VIF

--> During RDS Failover, DNS Record changes but not the RDS endpoint

--> Cross Region Read Replica exists for RDS and Aurora

--> EBS volume replication in the same region is also via snapshot

--> NAT G/W cannot be assigned security group. Can be assigned Elastic IP but not Public IP

--> 2 tunnels configured for each VPN Connection

--> you need one private VIF(VPC) and one public VIF (AWS services) for Direct connect.

--> DX connection with a VPN backup

--> IPV6 is globally unique..same as Public IP. Egress is related to IPV6.

--> instance store are virtual hard drive..ephemeral storage

--> you can increase the size of volume but cant scale down

--> bastion host allows SSH or RDP access. Can be assigned Public IP or Elastic IP. Sits in Public subnet and allows access to private subnet.

--> EC2 can have secondary ENI which can be detached and assigned to another EC2 incase of failover.

--> Copying the EBS volume from one AZ to another, create a snapshot and launch the EBS from it. EBS volumes are AZ specific.

--> For sharing encrypted snapshot with others , one would have to share the CMK key permissions as well

--> ELB routes the traffic to Eth0 primary IP of ur EC2. ELB needs 8 IP address to scale.

--> NLB doesn't support Lambda target type

--> you cannot read/write to StandBy RDS instance

--> Snapshots and Auto Backups done on StandBy Instance. Only Manual snapshots can be shared..not automatic ones

--> The first snapshot is full and after that it's incremental backups. DB Transaction logs allows upto 5 mins of RPO

--> you cannot restore into an existing DB..it has to be a new DB instance.

--> Automatic backups must be enabled for Read Replica to work

--> RDS Read Replicas can be in a different region as well. Based on automatic snapshots being copied asynchronously.

--> you cannot encrypt the DB instance on the go. Create a snapshot--> copy the snapshot and encrypt it and create a DB instance out of it.

--> you can't disable encryption as well.

--> IAM DB authentication works with My SQL and PostGre SQL.

--> AWS doesn't support IP Multicast

--> Redshift can load data only from S3 Standard.

--> Aurora spans Multi AZ in one region..maintains at least 6 copies

--> Failover priority for Aurora replicas. 15 read replicas supported.

--> Aurora Clusters can have single instance or multiple instance behind it. Aurora clusters are created inside the VPC.

--> Aurora Global cluster has a primary cluster in one region and readonly cluster in another region. 16 replicas in other region.

--> Backtracking takes the data backup to a point in time. Doesn't need a new DB instance to recover.

--> Cloudtrail event history is logged for 90 days. Cloud trail logging integrity check via SHA hashing on S3

--> Log retention of cloudwatch 1 day to 10 years. Auto delete. CW Logs encrypted at Rest and In-transit.

--> Unified cloudwatch agent is preferred over old one.

--> Create custom metric from CW logs --> Create Alarm using that metric filter --> Attach an EC2 instance/ASG or SNS topic with that

--> Logs could be sent to cross account using kinesis streams

--> S3 Max size is 5TB. Multipart upload for more than 5GB. Preferred for > 100MB

--> Standard IA for backups. 30 day minimum charge for storage

--> Glacier for Archives with retrieval time of minutes. Min storage is 90 days. Can sustain data loss in 2 facilities. Doesn't maintain metadata. Use a DB for that.

--> Deep Archive ..not for real time retrieval. 180 days of min storage.

--> RRS. Frequently accessed non critical data. Not recommended.

--> S3 uses SSE and KMS encryption. S3 Static hosting allows redirection but only http

--> Pre-Signed URLs to provide specific object access to users for limited time.

--> For S3 Cross Region or same region replication, the bucket versioning must be enabled.

--> S3 replicated the objects, metadata and the tags as well. Replicates the delete marker but doesn't delete.

--> S3 transfer acceleration used for uploading objects over internet.

--> S3 provides 3500 PUT/COPY/POST/DELETE per seconds and 5500 GET/HEAD. Prefixes created multiply the capacity.

--> Elemental MediaStore is caching and distribution for video workflows. Delivers videos from S3.

--> S3 Select to query data in S3 if stored in CSV, JSON format. Glacier Select for Glacier.

--> Permission to enable server accesss log only via bucket ACL not by policy. ACLs are handy for object level permissions.

--> Requester pay buckets for charging requester to pay for access or downloading

-->EFS is POSIX (UNIX) compliant and needs to be mounted on EC2 or On-prem servers. Suited for BigData & Analytics. Low latency file operations.

--> EFS can be mounted on linux on-prem only

--> Amazon FSx is windows file share. Windows shadow copies are point in time snapshot stored in S3.

--> Lustre is Open source high performance computing. Linux based DFS. Not durable

--> cname cannot be created for naked zone apex.

--> Traffic flow policy only for public hosted zone

--> S3 only allows HTTP protocol for Origin access from Cloudfront.

--> Invalidate Only Web objects from CDN

--> EMR is not for real time ingestion or large data

--> Redshift supports Single AZ

--> EMR leverages Apache Hadoop + Hive . SQL based query and support for unstructured data.

--> EMR kinesis connector to read streaming data from kinesis to EMR for processing

--> Glue Crawlers identifies metadata and populates the glue catalog used by Glue to transform the data

--> SQS Queue can be encryption enabled.

--> AWS Serverless Application Module is an extension of Cloudformation.

--> AWS Batch uses ECS Container to run jobs

--> You cannot have mount points for the same EFS file systems in different VPC. Only 1 at a time.

--> Amazon FSx has Standby file server in a different AZ synchronously replicated.

--> NAT GW uses only Elastic IPs. No Public IP

--> Session stickiness/affinity can be application controlled or LB controlled. Cookie header inserted.

--> ALB supports only load balancer cookies. Cookie name AWSALB

--> DAX for Dynamo DB is Only for eventual consistent and deployed on EC2 instance with a DAX agent

--> SCP has no effect on the Primary/Master account though it is applied at root level. Whitelisting or Blacklisting policies at SCP.

--> AWS Organization creates Service Linked Role in each of member accounts to access.

--> IAM roles Trust Policies have principals which are Trusted Entities. Such as federated: saml-provider/ADFS or service: ec2 or lambda

--> RDS Read Replicas are created using the snapshot from primary DB or Standby (MultiAZ). Asynchronously update

--> Aurora Serverless clusters are always encrypted

--> Cloudformation has Creation policy which allows waitCondition to delay resource creation. Deletion policy to delete, retain or snapshot resource.

--> blue green deployment is an Active StandBy configuration

--> Cloudwatch Synthetics are used for running Canaries Script to automate user/customer actions for the service.

--> Cloudwatch ServiceLens is used for endtoend debugging with Cloudwatch + Xray

--> AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS. Also, maintain the catalog

portfolio stack which are cloud formation stack ready to build common stacks.

--> AWS Resource DataSync allows to collect data configuration from multiple resources.

--> AWS DNS Doesn't support DNSSEC (Domain Name System Security Extensions). Use 3rd Party DNS Provider for this support.

--> AWS Systems Manager Patch Manager automates the process of patching managed instances with security-related updates.

Patch Manager uses patch baselines, which include rules for auto-approving patches. A patch group is an optional means of organizing instances for patching.

For example, you can create patch groups for different operating systems (Linux or Windows), different environments (Development, Test, and Production),

or different server functions (web servers, file servers, databases).

--> If you want a single store for configuration and secrets, you can use Parameter Store. If you want a dedicated secrets store with lifecycle management, use Secrets Manager.

--> To use a certificate with Elastic Load Balancing for the same site (the same fully qualified domain name, or FQDN, or set of FQDNs) in a different Region,

you must request a new certificate for each Region in which you plan to use it. To use an ACM certificate with Amazon CloudFront,

you must request the certificate in the US East (N. Virginia) region.

--> DynamoDB Global Table. A replica table (or replica, for short) is a single DynamoDB table that functions as a part of a global table. Each replica stores the same set of

data items. Any given global table can only have one replica table per AWS Region.

--> AWS CloudHSM is a cloud-based hardware security module (HSM) that enables you to easily generate and use your own encryption keys on the AWS Cloud.

Handles SSL offloading as well and for HA, needs 2 subnets

--> can't deploy an application to your on-premises servers using Elastic Beanstalk

--> AWS Resource Access Manager (AWS RAM) enables you to share specified AWS resources that you own with other AWS accounts. To enable trusted access

with AWS Organizations: From the AWS RAM CLI, use the enable-sharing-with-aws-organizations command.

Name of the IAM service-linked role that can be created in accounts when trusted access is enabled: AWSResourceAccessManagerServiceRolePolicy

--> Rehost (“lift and shift”) - In a large legacy migration scenario where the organization is looking to quickly implement its migration and scale to meet a business case,

we find that the majority of applications are rehosted. Most rehosting can be automated with tools such as AWS SMS although you may prefer to do this manually as

you learn how to apply your legacy systems to the cloud.

You may also find that applications are easier to re-architect once they are already running in the cloud. This happens partly because your organization will have developed better skills to do so and partly because the hard part - migrating the application, data, and traffic - has already been accomplished.

--> Replatform (“lift, tinker and shift”) -This entails making a few cloud optimizations in order to achieve some tangible benefit without changing the core architecture of the application. For example, you may be looking to reduce the amount of time you spend managing database instances by migrating to a managed relational database service such as Amazon Relational Database Service (RDS), or migrating your application to a fully managed platform like AWS Elastic Beanstalk.

--> By default, the data in a Redis node on ElastiCache resides only in memory and is not persistent. If a node is rebooted, or if the underlying physical

server experiences a hardware failure, the data in the cache is lost.

If you require data durability, you can enable the Redis append-only file feature (AOF). When this feature is enabled, the node writes all of the commands that change cache data to an append-only file. When a node is rebooted and the cache engine starts, the AOF is "replayed"; the result is a warm Redis cache with all of the data intact.

-->Turn off the Reserved Instance (RI) sharing on the master account for all of the member accounts.

--> You can use AWS SAM with a suite of AWS tools for building serverless applications. To build a deployment pipeline for your serverless applications, you can use CodeBuild, CodeDeploy, and CodePipeline. You can also use AWS CodeStar to get started with a project structure, code repository, and a CI/CD pipeline that's automatically configured for you. To deploy your serverless application, you can use the Jenkins plugin, and you can use Stackery.io's toolkit to build production-ready applications.

--> You can improve performance by increasing the proportion of your viewer requests that are served from CloudFront edge caches instead of going to your origin servers for content; that is, by improving the cache hit ratio for your distribution. To increase your cache hit ratio, you can configure your origin to add a Cache-Control max-age directive to your objects, and specify the longest practical value for max-age. The shorter the cache duration, the more frequently CloudFront forwards another request to your origin to determine whether the object has changed and, if so, to get the latest version.

--> you can set up an origin failover by creating an origin group with two origins with one as the primary origin and the other as the second origin which CloudFront automatically switches to when the primary origin fails.

--> Modifying the enableDnsHostNames attribute of your VPC to false and the enableDnsSupport attribute to true is incorrect because with this configuration, your EC2 instances launched in the VPC will not get public DNS hostnames.

--> SCPs DO NOT affect any service-linked role. Service-linked roles enable other AWS services to integrate with AWS Organizations and can't be restricted by SCPs.

--> You can bring part or all of your public IPv4 address range from your on-premises network to your AWS account. You continue to own the address range, but AWS advertises it on the Internet. After you bring the address range to AWS, it appears in your account as an address pool. You can create an Elastic IP address from your address pool and use it with your AWS resources, such as EC2 instances, NAT gateways, and Network Load Balancers. This is also called "Bring Your Own IP Addresses (BYOIP)".

--> Amazon Connect provides a seamless omnichannel experience through a single unified contact center for voice and chat. Contact center agents and managers don’t have to learn multiple tools, because Amazon Connect has the same contact routing, queuing, analytics, and management tools in a single UI across voice, web chat, and mobile chat.

--> Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions.

--> Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries. Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues.

--> By setting up cross-account access in this way, you don't need to create individual IAM users in each account.

--> installing the SSM Agent to all of your instances is also required when using the AWS Systems Manager Patch Manager.

--> the best solution is to use a combination of CloudFront, Elastic Load Balancer and SQS to provide a highly scalable architecture. SQS decouples the service. Reduces cost.

--> Inter-Region VPC Peering provides a simple and cost-effective way to share resources between regions or replicate data for geographic redundancy.

--> Adding tags to the EC2 instances in the production environment and adding resource-level permissions to the developers with an explicit deny on terminating the instance

which contains the tag

--> SSE-S3 provides strong multi-factor encryption in which each object is encrypted with a unique key. It also encrypts the key itself with a master key that it rotates regularly accurately describes how SSE-S3 encryption works.

--> SCP does not grant any permissions, unlike an IAM Policy. SCP policy simply specifies the services and actions that users and roles can use in the accounts.

--> If you have a VPC peered with multiple VPCs that have overlapping or matching CIDR blocks, ensure that your route tables are configured to avoid sending response traffic

from your VPC to the incorrect VPC. Add a static route on VPC A's route table with a destination of 10.0.0.0/24 and a target of pcx-aaaabbbb. The route for 10.0.0.0/24 traffic

is more specific than 10.0.0.0/16, therefore, traffic destined for the 10.0.0.0/24 IP address range goes via VPC peering connection pcx-aaaabbbb instead of pcx-aaaacccc.

--> In AWS Storage Gateway, your iSCSI initiators connect to your volumes as iSCSI targets. Storage Gateway uses Challenge-Handshake Authentication Protocol (CHAP) to

authenticate iSCSI and initiator connections. CHAP provides protection against playback attacks by requiring authentication to access storage volume targets

--> AWS Systems Manager Patch Manager automates the process of patching managed instances with security-related updates.

--> won't be able to upload their articles to the Read Replicas in the event that the primary database goes down.

--> Cloudfront signed cookies feature is primarily used if you want to provide access to multiple restricted files, for example, all of the files for a video in HLS format or all

of the files in the subscribers' area of website.

--> If you have multiple VPN connections, you can provide secure communication between sites using the AWS VPN CloudHub. This enables your remote sites to communicate with each other, and not just with the VPC.

--> The deployment services offer two methods to help you update your application stack, namely in-place and disposable. An in-place upgrade involves performing application updates on live Amazon EC2 instances. A disposable upgrade, on the other hand, involves rolling out a new set of EC2 instances by terminating older instances.

--> Hybrid Deployment model: combines the simplicity of managing AWS infrastructure provided by Elastic Beanstalk and the automation of custom network segmentation

provided by AWS CloudFormation.

--> AWS Control Tower is best suited if you want an automated deployment of a multi-account environment with AWS best practices. If you want to define your own custom

multi-account environment with advanced governance and management capabilities, we would recommend AWS Organizations

--> AWS organization entities are globally accessible, similar to how AWS Identity and Access Management (IAM) works today.

--> You cannot change which AWS account is the master account.

--> ELB cannot have EIP or Static IP attached.

--> OpsCenter is a Systems Manager capability that provides a central location where operations engineers, IT professionals, and others can view, investigate, and

resolve operational issues related to their environment.

--> EMR doesn't provide Detailed Monitoring. ELB, R53, RDS does

--> IPSec Tunnel provides data encryption across internet, protection of data in transit, peer identity auth vpn GW and Customer GW and data integrity protection.

--> ARN syntax arn:aws:service:region:account:resource. For IAM, region is left blank.

--> Default Security Group allows No Inbound Traffic. Allows All Oubound Traffic. Allows instances to communicate which have same SG.

--> ApplyImmediately to apply changes on RDS intances.

--> Amazon cognito maintains the last written version of the data using synchronize() method. Cognito uses SNS Push () to send notifications to devices.

-->AWS IAM passwords can contain the Basic Latin characters. Policy names cannot have /,\, * or ?. These are reserved. Path names should start and end with /. Max length 64 chars

--> AWS Doesn't Auto assign Public IP to Instance with Multiple ENIs.

--> you can create a VPC with multiple subnets and assign users to have access to only their specific subnet.

--> ELB supports only 1 Subnet from 1 AZ

--> IAM also has NotPrincipal in Policy.

--> Minimum storage for Provisioned IOPS MySQL RDS is 100 GB and min IOPS is 1000.

--> DataPipeline retries are allowed upto 10 retries.

--> SNS doesn't push notifications to Microsoft Windows Mobile Messaging. Microsoft Push Notification is supported.

--> as-describe-launch-configs-show-long shows launch config name, instance type, ami id.

--> ElasticCache stores critical piece of data in memory for low latency access.

--> Exception to a list of actions in IAM policy is provided by NotAction. MultiFactorAuthAge to check in seconds last MFA action.

--> Amazon ElasticCache's cache security group are applicable to cache clusters running outside of VPC.

--> PIOPS EBS Volumes supports 4GiB to 16 TiB and provision upto 20000 IOPS. But the ratio should be max 30. Now, it is 50:1

--> AWS Elastic Beanstalk provides several options for how deployments are processed, including deployment policies

(All at once, Rolling, Rolling with additional batch, and Immutable)

--> API GW Throttling error code is 429 while lambda or integration connection gives 504

--> In CloudFront, there are 3 options that you can choose as the value for your Origin Protocol Policy: HTTP Only, HTTPS Only and Match Viewer.

--> Amazon Inspector enables you to analyze the behavior of your AWS resources and helps you to identify potential security issues. Using Amazon Inspector, you can

define a collection of AWS resources that you want to include in an assessment target. You can then create an assessment template and launch a security assessment

run of this target. Just assessment..no capability to change or update.

--> In Redshift, if your query operation hangs or stops responding: Connection to the Database Is Dropped--> Reduce the size of maximum transmission unit (MTU)

Connection to the Database Times Out--> hang or timeout when running long queries, such as a COPY command

Client-Side Out-of-Memory Error--> ODBC or JDBC Out of Memory. There Is a Potential Deadlock --> Check STV_LOCKS and STL_TR_CONFLICT.

Use the PG_CANCEL_BACKEND and PG_TERMINATE_BACKEND.

--> Amazon MQ is a managed message broker service for Apache ActiveMQ that makes it easy to set up and operate message brokers in the cloud. Connecting your current

applications to Amazon MQ is easy because it uses industry-standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket

--> ALB Only supports http/https protocol..no tcp/tls

--> Amazon AppStream 2.0 is a fully managed application streaming service. Suited for standalone desktop applications.

--> There are a lot of available AWS Managed Policies that you can directly attach to your IAM Users, such as Administrator, Billing, Database Administrator, Data Scientist,

Developer Power User, Network Administrator, Security Auditor, System Administrator and many others

--> The following scenarios highlight patterns that may not be well suited for blue/green deployments:

Are your schema changes too complex to decouple from the code changes? Is sharing of data stores not feasible?

Does your application need to be "deployment aware"?

Does your commercial off-the-shelf (COTS) application come with a predefined update/upgrade process that isn’t blue/green deployment friendly?

--> Oracle RAC is not supported by RDS. RMAN

--> Create a new CloudFront web distribution and configure it to serve HTTPS requests using dedicated IP addresses in order to associate your alternate domain names with

a dedicated IP address in each CloudFront edge location.

--> Code deploy provides Canary deployment configuration, the traffic is shifted in two increments.

--> Tape Gateway which will back up your data in Amazon S3 and archive in Amazon Glacier using your existing tape-based processes.

--> A service control policy (SCP) is a policy that specifies the services and actions that users and roles can use in the specified AWS accounts.

SCPs are similar to IAM permission policies except that they don't grant any permissions. Instead, SCPs specify the maximum permissions for an organization,

organizational unit (OU), or account.

--> you can configure more than one load balancer with an autoscaling group.

--> If an Auto Scaling group is launching more than one instance, the cool down period for each instance starts after that instance is launched. The group remains locked until

the last instance that was launched has completed its cool down period. In this case the cool down period for the first instance starts after 3 minutes and finishes at the 10th

minute (3+7 cool down), while for the second instance it starts at the 4th minute and finishes at the 11th minute (4+7 cool down). Thus, the Auto Scaling group will receive

another request only after 11 minutes

--> Tags are assigned automatically to the instances created by an Auto Scaling group.

--> If you have a running instance using an Amazon EBS boot partition, you can also call the Stop Instances API to release the compute resources but preserve the data on the boot

partition

--> When a user launches an Amazon EBS-backed dedicated instance, the EBS volume does not run on single-tenant hardware.

--> If there is more than one rule for a specific port, we apply the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address

203.0.113.1 and another rule that allows access to TCP port 22 from everyone, everyone has access to TCP port 22.

--> The tag key cannot have a prefix as "aws:", although it can have only "aws".

--> The Amazon EC2 console provides a "Launch more like this" wizard which copies Instance Type, AMI, user-data, tags, placement group

--> You are charged for the stack resources for the time they were operating (even if you deleted the stack right away)

--> CloudFormation: Actual resource names are a combination of the stack and logical resource name.

--> To connect to Amazon Virtual Private Cloud (Amazon VPC) by using AWS Direct Connect, you must first do the following:

Provide a private Autonomous System Number (ASN) to identify your network on the Internet. Amazon then allocates a private IP address in the 169.x.x.x range to you.

Create a virtual private gateway and attach it to your VPC

--> IKE Security Association is established first between the virtual private gateway and customer gateway using the Pre-Shared Key as the authenticator.

--> To establish redundant VPN connections and customer gateways on your network, you would need to set up a second VPN connection. However, you must ensure that the

customer gateway IP address for the second VPN connection is publicly accessible.

--> DynamoDB Local Secondary Indexes can only be created while Table creation. DynamoDB uses JSON only as a transport protocol, not as a storage format.

--> you can copy data from an Amazon DynamoDB table into Amazon Redshift.

--> To construct the mount target's DNS name, use the following generic form: availability-zone.file-system-id.efs.aws-region.amazonaws.com

--> Component is code--> Workload is set of components --> technical capability is set of workloads.

--> using the AWS Server Migration Service (SMS) and installing the Server Migration Connector to your on-premises virtualization environment.

--> Amazon WorkDocs is a fully managed, secure content creation, storage, and collaboration service. With Amazon WorkDocs, you can easily create, edit, and share content, and because it’s stored centrally on AWS, access it from anywhere on any device. Amazon WorkDocs makes it easy to collaborate with others, and lets you easily share content, provide rich feedback, and collaboratively edit documents.

--> EC2Rescue can help you diagnose and troubleshoot problems on Amazon EC2 Linux and Windows Server instances. You can run the tool manually or you can run the tool automatically by using Systems Manager Automation and the AWSSupport-ExecuteEC2Rescue document. The AWSSupport-ExecuteEC2Rescue document is designed to perform a combination of Systems Manager actions, AWS CloudFormation actions, and Lambda functions that automate the steps normally required to use EC2Rescue.

--> AWS Step Functions provides serverless orchestration for modern applications. Orchestration centrally manages a workflow by breaking it into multiple steps, adding flow logic, and tracking the inputs and outputs between the steps.

--> Data Pipeline is for batch jobs.

--> Implementing database caching with CloudFront is incorrect because you cannot use CloudFront for database caching. CloudFront is primarily used to securely deliver data, videos, applications, and APIs to customers globally with low latency and high transfer speeds.

--> Snowball is suitable for the following use cases:

Import data into Amazon S3, Export from Amazon S3,

On the other hand, Snowball Edge is suitable for the below:

Import data into Amazon S3, Export from Amazon S3, Durable local storage, Local compute with AWS Lambda, Local compute instances, Use in a cluster of devices

Use with AWS Greengrass (IoT), Transfer files through NFS with a GUI

--> If you got your certificate from a third-party CA, import the certificate into ACM or upload it to the IAM certificate store. Hence, AWS Certificate Manager and IAM certificate store are the correct answers.

--> You can use an AWS Direct Connect gateway to connect your AWS Direct Connect connection over a private virtual interface to one or more VPCs in your account that are located in the same or different regions. You associate a Direct Connect gateway with the virtual private gateway for the VPC, and then create a private virtual interface for your AWS Direct Connect connection to the Direct Connect gateway.

--> All at once – Deploy the new version to all instances simultaneously. All instances in your environment are out of service for a short time while the deployment occurs. If the deployment fails, a system downtime will occur.

Rolling – Deploy the new version in batches. Each batch is taken out of service during the deployment phase, reducing your environment's capacity by the number of instances in a batch. If the deployment fails, a single batch will be out of service.

Rolling with additional batch – Deploy the new version in batches, but first launch a new batch of instances to ensure full capacity during the deployment process. This is quite similar with Rolling option. If the first batch fails, the impact would be minimal.

Immutable – Deploy the new version to a fresh group of instances by performing an immutable update. If the deployment fails, the impact is minimal.

Blue/green deployment – Deploy the new version to a separate environment, and then swap CNAMEs of the two environments to redirect traffic to the new version instantly. If the deployment fails, the impact is minimal.

--> Memacached allows MultiThreaded execution unlike Redis.

--> You can change the placement group for an instance in any of the following ways:

Move an existing instance to a placement group. Move an instance from one placement group to another. Remove an instance from a placement group.

Before you move or remove the instance, the instance must be in the stopped state. You can move or remove an instance using the AWS CLI or an AWS SDK.

--> Cognito User Pool handles the user AuthN to provide temp credentials to access EC2, ECS, API. Cognito Identity pool provides AuthZ to allows other aws service access (roles)

--> EFS storage is 47.9 TB, S3 5TB and EBS size of EBS.

--> Only NLB provides Static and Elastic IP. SNI provided by ALB and NLB.

--> gp2 16000 IOPS, PIOPS 64000 IOPS

--> EC2 health check watches for instance availability from hypervisor and networking point of view. For example, in case of a hardware problem, the check will fail. Also, if an instance was misconfigured and doesn't respond to network requests, it will be marked as faulty.

ELB health check verifies that a specified TCP port on an instance is accepting connections OR a specified web page returns 2xx code. Thus ELB health checks are a little bit smarter and verify that actual app works instead of verifying that just an instance works.

--> Redis single AZ setup has Append Only File . MultiAZ setup has redis read replica. You can have only 1 active at a time

--> Redis cluster mode enabled(multiple shards and multi AZ setup) and disabled (single shard and multi AZ setup).

--> If data needs to be transferred from multiple location then use Transfer Acceleration (Cloudfront edge) or use Snowball or edge. If repeated file transfers, use Direct connect

--> S3 Transfer Acce supports downloads as well

--> SSM State Manager maintains the state of EC2 and Hybrid infra.

--> CloudFormation has templates, stack and change sets

--> OpsWork can auto heal ur stack.

--> CloudFormation and Beanstalk can Only create infra in AWS not on-prem. AWS OpsWork and CodeDeploy can create infra on-prem and in AWS.

--> Elastic N/W Adapter better for enhanced n/wing. Multiple ENIs shouldn't be used.

--> The maximum size of the data payload of a Kineis Data Stream record before base64-encoding is up to 1 MB.

--> AMIs are a regional resource. Therefore, sharing an AMI makes it available in that region. To make an AMI available in a different Region, copy the AMI to the Region and then share it.

--> AWS Organizations comes with All Features(includes SCP and etc) or Consolidated Billing mode (only billing share)

--> The native tools allow you to migrate your data with minimal downtime. For eg. Mysqldump for MySQL migration

--> Tape gateway in AWS Storage Gateway service is primarily used as an archive solution. Cannot access the files on Tape Gateway directly..use File GW.

--> You share resources in one account with users in a different account. By setting up cross-account access in this way, you don't need to create individual IAM users in each account. U create role in Prod account , assign dev account role as trustee.

--> SCP Policies: Users and roles must still be granted permissions with appropriate IAM permission policies. A user without any IAM permission policies has no access, even if the applicable SCPs allow all services and all actions.

--> AWS tags are case sensitive. Pro-Active taggings are done using AWS Cloudformation , AWS Service Catalog. AWS IAM can allow/disallow service creation if tags are not there.

--> You can use AWS Config with CloudWatch Events to trigger automated responses to missing or incorrect tags.

--> Direct Connect provides consistent performance and latency for hybrid workloads and predictable performance.

--> If your exising backup software does not natively support cloud storage for backup or archive, you can use a storage gateway device, such as a bridge, between the backup software and Amazon S3 or Amazon Glacier.

--> AWS DataSync is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Windows File Server.

--> When migrating from one database source or version to a new platform or software version, AWS Database Migration Service keeps the source database fully operational during the migration, minimizing downtime to applications that rely on the database.

-->EC2 Failover to a replacement instance or (running) spare instance by remapping your elastic IP address to the new instance. An Elastic IP address is a static, public, IPv4 address allocated to your AWS account.

--> EBS volumes can be attached to a running EC2 instance and can persist independently from the instance.

--> Because snapshots represent the on-disk state of the application, care must be taken to flush in-memory data to disk before initiating a snapshot.

--> When you create a member account using the AWS Organizations console, AWS Organizations automatically creates an IAM role named OrganizationAccountAccessRole in the account. This role has full administrative permissions in the member account.

--> You can use trusted access to enable an AWS service that you specify, called the trusted service, to perform tasks in your organization and its accounts on your behalf.When you enable access, the trusted service can create an IAM role called a service-linked role in every account in your organization whenever that role is needed.

https://d1.awsstatic.com/whitepapers/Storage/Backup_and_Recovery_Approaches_Using_AWS.pdf?did=wp_card&trk=wp_card