Serverless ETL and Analytics with AWS Glue


Book Description

Build efficient data lakes that can scale to virtually unlimited size using AWS Glue Key Features Book DescriptionOrganizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes. Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options. By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.What you will learn Apply various AWS Glue features to manage and create data lakes Use Glue DataBrew and Glue Studio for data preparation Optimize data layout in cloud storage to accelerate analytics workloads Manage metadata including database, table, and schema definitions Secure your data during access control, encryption, auditing, and networking Monitor AWS Glue jobs to detect delays and loss of data Integrate Spark ML and SageMaker with AWS Glue to create machine learning models Who this book is for ETL developers, data engineers, and data analysts




Serverless ETL and Analytics with AWS Glue


Book Description

Build efficient data lakes that can scale to virtually unlimited size using AWS Glue Key Features Book DescriptionOrganizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes. Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options. By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.What you will learn Apply various AWS Glue features to manage and create data lakes Use Glue DataBrew and Glue Studio for data preparation Optimize data layout in cloud storage to accelerate analytics workloads Manage metadata including database, table, and schema definitions Secure your data during access control, encryption, auditing, and networking Monitor AWS Glue jobs to detect delays and loss of data Integrate Spark ML and SageMaker with AWS Glue to create machine learning models Who this book is for ETL developers, data engineers, and data analysts




Serverless Analytics with Amazon Athena


Book Description

Get more from your data with Amazon Athena's ease-of-use, interactive performance, and pay-per-query pricing Key FeaturesExplore the promising capabilities of Amazon Athena and Athena's Query Federation SDKUse Athena to prepare data for common machine learning activitiesCover best practices for setting up connectivity between your application and Athena and security considerationsBook Description Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure. This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You'll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you'll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you'll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you'll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server. By the end of this book, you'll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today's ML modeling exercises. What you will learnSecure and manage the cost of querying your dataUse Athena ML and User Defined Functions (UDFs) to add advanced features to your reportsWrite your own Athena Connector to integrate with a custom data sourceDiscover your datasets on S3 using AWS Glue CrawlersIntegrate Amazon Athena into your applicationsSetup Identity and Access Management (IAM) policies to limit access to tables and databases in Glue Data CatalogAdd an Amazon SageMaker Notebook to your Athena queriesGet to grips with using Athena for ETL pipelinesWho this book is for Business intelligence (BI) analysts, application developers, and system administrators who are looking to generate insights from an ever-growing sea of data while controlling costs and limiting operational burden, will find this book helpful. Basic SQL knowledge is expected to make the most out of this book.




Building ETL Pipelines with Python


Book Description

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Features Understand how to set up a Python virtual environment with PyCharm Learn functional and object-oriented approaches to create ETL pipelines Create robust CI/CD processes for ETL pipelines Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionModern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing. In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments. By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.What you will learn Explore the available libraries and tools to create ETL pipelines using Python Write clean and resilient ETL code in Python that can be extended and easily scaled Understand the best practices and design principles for creating ETL pipelines Orchestrate the ETL process and scale the ETL pipeline effectively Discover tools and services available in AWS for ETL pipelines Understand different testing strategies and implement them with the ETL process Who this book is for If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.




Geospatial Data Analytics on AWS


Book Description

Build an end-to-end geospatial data lake in AWS using popular AWS services such as RDS, Redshift, DynamoDB, and Athena to manage geodata Purchase of the print or Kindle book includes a free PDF eBook. Key Features Explore the architecture and different use cases to build and manage geospatial data lakes in AWS Discover how to leverage AWS purpose-built databases to store and analyze geospatial data Learn how to recognize which anti-patterns to avoid when managing geospatial data in the cloud Book DescriptionManaging geospatial data and building location-based applications in the cloud can be a daunting task. This comprehensive guide helps you overcome this challenge by presenting the concept of working with geospatial data in the cloud in an easy-to-understand way, along with teaching you how to design and build data lake architecture in AWS for geospatial data. You’ll begin by exploring the use of AWS databases like Redshift and Aurora PostgreSQL for storing and analyzing geospatial data. Next, you’ll leverage services such as DynamoDB and Athena, which offer powerful built-in geospatial functions for indexing and querying geospatial data. The book is filled with practical examples to illustrate the benefits of managing geospatial data in the cloud. As you advance, you’ll discover how to analyze and visualize data using Python and R, and utilize QuickSight to share derived insights. The concluding chapters explore the integration of commonly used platforms like Open Data on AWS, OpenStreetMap, and ArcGIS with AWS to enable you to optimize efficiency and provide a supportive community for continuous learning. By the end of this book, you’ll have the necessary tools and expertise to build and manage your own geospatial data lake on AWS, along with the knowledge needed to tackle geospatial data management challenges and make the most of AWS services.What you will learn Discover how to optimize the cloud to store your geospatial data Explore management strategies for your data repository using AWS Single Sign-On and IAM Create effective SQL queries against your geospatial data using Athena Validate postal addresses using Amazon Location services Process structured and unstructured geospatial data efficiently using R Use Amazon SageMaker to enable machine learning features in your application Explore the free and subscription satellite imagery data available for use in your GIS Who this book is forIf you understand the importance of accurate coordinates, but not necessarily the cloud, then this book is for you. This book is best suited for GIS developers, GIS analysts, data analysts, and data scientists looking to enhance their solutions with geospatial data for cloud-centric applications. A basic understanding of geographic concepts is suggested, but no experience with the cloud is necessary for understanding the concepts in this book.




AWS Certified Data Analytics Study Guide with Online Labs


Book Description

Virtual, hands-on learning labs allow you to apply your technical skills in realistic environments. So Sybex has bundled AWS labs from XtremeLabs with our popular AWS Certified Data Analytics Study Guide to give you the same experience working in these labs as you prepare for the Certified Data Analytics Exam that you would face in a real-life application. These labs in addition to the book are a proven way to prepare for the certification and for work as an AWS Data Analyst. AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam is intended for individuals who perform in a data analytics-focused role. This UPDATED exam validates an examinee's comprehensive understanding of using AWS services to design, build, secure, and maintain analytics solutions that provide insight from data. It assesses an examinee's ability to define AWS data analytics services and understand how they integrate with each other; and explain how AWS data analytics services fit in the data lifecycle of collection, storage, processing, and visualization. The book focuses on the following domains: • Collection • Storage and Data Management • Processing • Analysis and Visualization • Data Security This is your opportunity to take the next step in your career by expanding and validating your skills on the AWS cloud. AWS is the frontrunner in cloud computing products and services, and the AWS Certified Data Analytics Study Guide: Specialty exam will get you fully prepared through expert content, and real-world knowledge, key exam essentials, chapter review questions, and much more. Written by an AWS subject-matter expert, this study guide covers exam concepts, and provides key review on exam topics. Readers will also have access to Sybex's superior online interactive learning environment and test bank, including chapter tests, practice exams, a glossary of key terms, and electronic flashcards. And included with this version of the book, XtremeLabs virtual labs that run from your browser. The registration code is included with the book and gives you 6 months of unlimited access to XtremeLabs AWS Certified Data Analytics Labs with 3 unique lab modules based on the book.




AWS certification guide - AWS Certified Data Analytics - Specialty


Book Description

AWS Certification Guide - AWS Certified Data Analytics – Specialty Unlock the Power of AWS Data Analytics Dive into the evolving world of AWS data analytics with this comprehensive guide, tailored for those pursuing the AWS Certified Data Analytics – Specialty certification. This book is an essential resource for professionals seeking to validate their expertise in extracting meaningful insights from data using AWS analytics services. Inside, You'll Discover: Comprehensive Analytics Concepts: Thorough exploration of AWS data analytics services and tools, including Kinesis, Redshift, Glue, and more. Real-World Scenarios: Practical examples and case studies that demonstrate how to effectively use AWS services for data analysis, processing, and visualization. Targeted Exam Preparation: Insights into the certification exam format, with chapters aligned to the exam domains, complete with detailed explanations and practice questions. Latest Trends and Best Practices: Up-to-date information on the newest AWS features and data analytics best practices, ensuring your skills remain at the cutting edge. Authored by a Data Analytics Expert Written by a professional with extensive experience in AWS data analytics, this guide melds practical application with theoretical knowledge, providing a rich learning experience. Your Comprehensive Analytics Resource Whether you are deepening your existing skills or embarking on a new specialty in data analytics, this book is your definitive companion, offering a deep dive into AWS analytics services and preparing you for the Specialty certification exam. Advance Your Data Analytics Career Go beyond the fundamentals and master the complexities of AWS data analytics. This guide is not just about passing the exam; it's about developing expertise that can be applied in real-world scenarios, propelling your career forward in this exciting domain. Start Your Specialized Analytics Journey Today Embark on your path to becoming an AWS Certified Data Analytics specialist. This guide is your first step towards mastering AWS analytics and unlocking new career opportunities in the field of data. © 2023 Cybellium Ltd. All rights reserved. www.cybellium.com




Data Engineering with AWS


Book Description

Looking to revolutionize your data transformation game with AWS? Look no further! From strong foundations to hands-on building of data engineering pipelines, our expert-led manual has got you covered. Key Features Delve into robust AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines Stay up to date with a comprehensive revised chapter on Data Governance Build modern data platforms with a new section covering transactional data lakes and data mesh Book DescriptionThis book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!What you will learn Seamlessly ingest streaming data with Amazon Kinesis Data Firehose Optimize, denormalize, and join datasets with AWS Glue Studio Use Amazon S3 events to trigger a Lambda process to transform a file Load data into a Redshift data warehouse and run queries with ease Visualize and explore data using Amazon QuickSight Extract sentiment data from a dataset using Amazon Comprehend Build transactional data lakes using Apache Iceberg with Amazon Athena Learn how a data mesh approach can be implemented on AWS Who this book is forThis book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts, while gaining practical experience with common data engineering services on AWS, will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book, but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.




Simplify Big Data Analytics with Amazon EMR


Book Description

Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services Key FeaturesBuild data pipelines that require distributed processing capabilities on a large volume of dataDiscover the security features of EMR such as data protection and granular permission managementExplore best practices and optimization techniques for building data analytics solutions in Amazon EMRBook Description Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS. What you will learnExplore Amazon EMR features, architecture, Hadoop interfaces, and EMR StudioConfigure, deploy, and orchestrate Hadoop or Spark jobs in productionImplement the security, data governance, and monitoring capabilities of EMRBuild applications for batch and real-time streaming data analytics solutionsPerform interactive development with a persistent EMR cluster and NotebookOrchestrate an EMR Spark job using AWS Step Functions and Apache AirflowWho this book is for This book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.




Learning AWS


Book Description

Discover techniques and tools for building serverless applications with AWS Key Features Get well-versed with building and deploying serverless APIs with microservices Learn to build distributed applications and microservices with AWS Step Functions A step-by-step guide that will get you up and running with building and managing applications on the AWS platform Book Description Amazon Web Services (AWS) is the most popular and widely-used cloud platform. Administering and deploying application on AWS makes the applications resilient and robust. The main focus of the book is to cover the basic concepts of cloud-based development followed by running solutions in AWS Cloud, which will help the solutions run at scale. This book not only guides you through the trade-offs and ideas behind efficient cloud applications, but is a comprehensive guide to getting the most out of AWS. In the first section, you will begin by looking at the key concepts of AWS, setting up your AWS account, and operating it. This guide also covers cloud service models, which will help you build highly scalable and secure applications on the AWS platform. We will then dive deep into concepts of cloud computing with S3 storage, RDS and EC2. Next, this book will walk you through VPC, building realtime serverless environments, and deploying serverless APIs with microservices. Finally, this book will teach you to monitor your applications, and automate your infrastructure and deploy with CloudFormation. By the end of this book, you will be well-versed with the various services that AWS provides and will be able to leverage AWS infrastructure to accelerate the development process. What you will learn Set up your AWS account and get started with the basic concepts of AWS Learn about AWS terminology and identity access management Acquaint yourself with important elements of the cloud with features such as computing, ELB, and VPC Back up your database and ensure high availability by having an understanding of database-related services in the AWS cloud Integrate AWS services with your application to meet and exceed non-functional requirements Create and automate infrastructure to design cost-effective, highly available applications Who this book is for If you are an I.T. professional or a system architect who wants to improve infrastructure using AWS, then this book is for you. It is also for programmers who are new to AWS and want to build highly efficient, scalable applications.