Offering Free access to ML Data Scientist Databricks-Machine-Learning-Associate Exam Questions Pool Bank

Databricks Certified Machine Learning Associate Exam Questions and Answers

Testing Engine

Product Type: Testing Engine

$37.5 ~~$124.99~~

Add to Cart

PDF + Testing Engine

Product Type: PDF + Testing Engine

$52.5 ~~$174.99~~

Add to Cart

PDF Study Guide

Product Type: PDF Study Guide

$33 ~~$109.99~~

Add to Cart

Question 1

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

Options:

MLflow Experiment Tracking

Spark ML

Autoscaling clusters

Delta Lake

Answer:

Explanation:

Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.

References

Apache Spark MLlib Guide:https://spark.apache.org/docs/latest/ml-guide.html

Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how Spark ML can be used:

Hyperparameter Tuning with CrossValidator: Spark ML includes theCrossValidatorandTrainValidationSplitclasses, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define the model

model = ...

# Create a parameter grid

paramGrid = ParamGridBuilder() \

addGrid(model.hyperparam1, [value1, value2]) \

addGrid(model.hyperparam2, [value3, value4]) \

build()

# Define the evaluator

evaluator = BinaryClassificationEvaluator()

# Define the CrossValidator

crossval = CrossValidator(estimator=model,

estimatorParamMaps=paramGrid,

evaluator=evaluator,

numFolds=3)

Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.

References

Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML

Question 2

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

Spark ML decision trees test every feature variable in the splitting algorithm

Spark ML decision trees automatically prune overfit trees

Spark ML decision trees test more split candidates in the splitting algorithm

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

Spark ML decision trees test binned features values as representative split candidates

Question 3

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A holdout set is not necessary when using a train-validation split

Reproducibility is achievable when using a train-validation split

Fewer hyperparameter values need to be tested when usinga train-validation split

Bias is avoidable when using a train-validation split

Fewer models need to be trained when using a train-validation split

Question 4

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

One-hot encoding is not supported by most machine learning libraries.

One-hot encoding is dependent on the target variable's values which differ for each application.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Question 5

Which of the following machine learning algorithms typically uses bagging?

Options:

IGradient boosted trees

K-means

Random forest

Decision tree

Question 6

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

Options:

spark_df[spark_df["price"] > 0]

spark_df.filter(col("price") > 0)

SELECT * FROM spark_df WHERE price > 0

spark_df.loc[spark_df["price"] > 0,:]

spark_df.loc[:,spark_df["price"] > 0]

Question 7

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

● Hyperparameter 1: [2, 5, 10]

● Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

Options:

Question 8

A data scientist is using Spark ML to engineer features for an exploratory machine learning project.

They decide they want to standardize their features using the following code block:

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.

Which of the following changes can the data scientist make to address the concern?

Options:

Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values

Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values

Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

Utilize the Pipeline API to standardize the training data according to the test data's summary statistics

Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Question 9

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

Options:

RMSE

Precision

Area under the residual operating curve

Accuracy

Recall

Question 10

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.

Which of the following compute tools is best suited for this use case?

Options:

Single Node cluster

Standard cluster

SQL Warehouse

None of these compute tools support this task

Question 11

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

Options:

R-squared

MAE

MSE

Question 12

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.

As a result, they have the following code block:

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options:

Change SparkTrials() to Trials()

Reduce num_evals to be less than 10

Change fmin() to fmax()

Remove the trials=trials argument

Remove the algo=tpe.suggest argument

Question 13

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.

Which approach should the data scientist take to complete this task?

Options:

They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.

They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.

They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.

They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

Question 14

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

Options:

PySpark DataFrame API

pandas API on Spark

Spark SQL

Feature Store

Question 15

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

When the new solution requires if-else logic determining which model to use to compute each prediction

When the new solution's models have an average latency that is larger than the size of the original model

When the new solution requires the use of fewer feature variables than the original model

When the new solution requires that each model computes a prediction for every record

When the new solution's models have an average size that is larger than the size of the original model

Question 16

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

spark_df.to_sql()

import pandas as pd

df = pd.DataFrame(spark_df)

spark_df.to_pandas()

Question 17

Which statement describes a Spark ML transformer?

Options:

A transformer is an algorithm which can transform one DataFrame into another DataFrame

A transformer is a hyperparameter grid that can be used to train a model

A transformer chains multiple algorithms together to transform an ML workflow

A transformer is a learning algorithm that can use a DataFrame to train a model

Question 18

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

spark_df.to_pandas()

import pandas as pd

df = pd.DataFrame(spark_df)

Question 19

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

Options:

Random Search

Halving Random Search

Tree of Parzen Estimators

Grid Search

Question 20

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

spark_df.describe()

dbutils.data(spark_df).summarize()

This task cannot be accomplished in a single line of code.

spark_df.summary()

dbutils.data.summarize (spark_df)

Question 21

A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation.

They attempt to run the following code block, but it does not accomplish the desired task:

Which of the following changes can the data scientist make to accomplish the task?

Options:

Replace the GridSearchCV operation with RandomizedSearchCV

Replace the GridSearchCV operation with cross_validate

Replace the GridSearchCV operation with ParameterGrid

Replace the random_state=0 argument with random_state=1

Replace the penalty= ['12', '11'] argument with penalty=uniform ('12', '11')

Question 22

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.

Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

Options:

fmin

SparkTrials

quniform

search_space

objective_function

Load More Databricks-Machine-Learning-Associate Questions

Special Summer Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70special

Databricks Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Exam Practice Test

Databricks Certified Machine Learning Associate Exam Questions and Answers

Testing Engine

PDF + Testing Engine

PDF Study Guide

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: