Black Friday Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70special

Databricks Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Exam Practice Test

Databricks Certified Data Engineer Professional Exam Questions and Answers

Testing Engine

  • Product Type: Testing Engine
$37.5  $124.99

PDF Study Guide

  • Product Type: PDF Study Guide
$33  $109.99
Question 1

A Data engineer wants to run unit’s tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

Options:

A.

Run unit tests against non-production data that closely mirrors production

B.

Define and unit test functions using Files in Repos

C.

Define units test and functions within the same notebook

D.

Define and import unit test functions from a separate Databricks notebook

Question 2

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:

A.

The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

B.

A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.

C.

The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.

D.

An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.

E.

An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Question 3

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

Options:

A.

Workspace audit logs

B.

Driver's log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor's log file

Question 4

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

Options:

A.

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question 5

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

A.

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

B.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

C.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

D.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

E.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Question 6

A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?

Options:

A.

The table reference in the metastore is updated and no data is changed.

B.

The table name change is recorded in the Delta transaction log.

C.

All related files and metadata are dropped and recreated in a single ACID transaction.

D.

The table reference in the metastore is updated and all data files are moved.

E.

A new Delta transaction log Is created for the renamed table.

Question 7

Review the following error traceback:

Which statement describes the error being raised?

Options:

A.

The code executed was PvSoark but was executed in a Scala notebook.

B.

There is no column in the table named heartrateheartrateheartrate

C.

There is a type error because a column object cannot be multiplied.

D.

There is a type error because a DataFrame object cannot be multiplied.

E.

There is a syntax error because the heartrate column is not correctly identified as a column.

Question 8

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

Options:

A.

They are merged.

B.

They are ignored.

C.

They are updated.

D.

They are inserted.

E.

They are deleted.

Question 9

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

A.

Can manage

B.

Can edit

C.

Can run

D.

Can Read

Question 10

Which is a key benefit of an end-to-end test?

Options:

A.

It closely simulates real world usage of your application.

B.

It pinpoint errors in the building blocks of your application.

C.

It provides testing coverage for all code paths and branches.

D.

It makes it easier to automate your test suite

Question 11

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Options:

A.

Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.

B.

Use merge into to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.

C.

Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.

D.

Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

E.

Ingest all log information into a bronze table; use merge into to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Question 12

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

to_interval("event_time", "5 minutes").alias("time")

B.

window("event_time", "5 minutes").alias("time")

C.

"event_time"

D.

window("event_time", "10 minutes").alias("time")

E.

lag("event_time", "10 minutes").alias("time")

Question 13

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

A.

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

B.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

C.

Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.

D.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

E.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

Question 14

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Options:

A.

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

B.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

C.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

D.

Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

E.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Question 15

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

Options:

A.

/jobs/runs/list

B.

/jobs/runs/get-output

C.

/jobs/runs/get

D.

/jobs/get

E.

/jobs/list

Question 16

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

Question 17

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

B.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

C.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

D.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Question 18

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Question 19

Which statement characterizes the general programming model used by Spark Structured Streaming?

Options:

A.

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

B.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

C.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

D.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

E.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Question 20

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

A.

The total count of rows is calculated by scanning all data files

B.

The total count of rows will be returned from cached results unless REFRESH is run

C.

The total count of records is calculated from the Delta transaction logs

D.

The total count of records is calculated from the parquet file metadata

E.

The total count of records is calculated from the Hive metastore

Question 21

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Question 22

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

Options:

A.

Increase the shuffle partitions to account for additional aggregates

B.

Specify a new checkpointlocation

C.

Run REFRESH TABLE delta, /item_agg'

D.

Remove .option (mergeSchema', true') from the streaming write

Question 23

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

Question 24

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Which solution would improve the performance?

A)

B)

C)

D)

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Question 25

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

A.

"Can Manage" privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster

D.

"Can Restart" privileges on the required cluster

E.

Cluster creation allowed. "Can Restart" privileges on the required cluster

Question 26

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Options:

A.

No; Delta Lake manages streaming checkpoints in the transaction log.

B.

Yes; both of the streams can share a single checkpoint directory.

C.

No; only one stream can write to a Delta Lake table.

D.

Yes; Delta Lake supports infinite concurrent writers.

E.

No; each of the streams needs to have its own checkpoint directory.

Question 27

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

A.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

B.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

C.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

D.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

E.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Question 28

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

A.

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

B.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

C.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

E.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Question 29

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Question 30

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?

Options:

A.

DESCRIBE EXTENDED dev.pii test

B.

DESCRIBE DETAIL dev.pii test

C.

SHOW TBLPROPERTIES dev.pii test

D.

DESCRIBE HISTORY dev.pii test

E.

SHOW TABLES dev

Question 31

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?

Options:

A.

Size on Disk is> 0

B.

The number of Cached Partitions> the number of Spark Partitions

C.

The RDD Block Name included the '' annotation signaling failure to cache

D.

On Heap Memory Usage is within 75% of off Heap Memory usage

Question 32

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

A.

preds.write.mode("append").saveAsTable("churn_preds")

B.

preds.write.format("delta").save("/preds/churn_preds")

C)

D)

E)

C.

Option A

D.

Option B

E.

Option C

F.

Option D

G.

Option E

Question 33

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Options:

A.

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

B.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

C.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

D.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Question 34

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

A.

spark.sql.files.maxPartitionBytes

B.

spark.sql.autoBroadcastJoinThreshold

C.

spark.sql.files.openCostInBytes

D.

spark.sql.adaptive.coalescePartitions.minPartitionNum

E.

spark.sql.adaptive.advisoryPartitionSizeInBytes

Question 35

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

A.

Can Manage

B.

Can Edit

C.

No permissions

D.

Can Read

E.

Can Run

Question 36

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.