00:00

QUESTION 1

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Correct Answer: C
For large datasets with many variables, Spark ML distributes the training of a linear regression model using iterative optimization methods. Specifically, Spark ML employs algorithms such as Gradient Descent or L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) to iteratively minimize the loss function. These iterative methods are suitable for distributed computing environments and can handle
large-scale data efficiently by partitioning the data across nodes in a cluster and performing parallel updates.References:
✑ Spark MLlib Documentation (Linear Regression with Iterative Optimization).

QUESTION 2

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Correct Answer: C
The pandas API on Spark DataFrames are made up of Spark DataFrames with additional metadata. The pandas API on Spark aims to provide the pandas-like experience with the scalability and distributed nature of Spark. It allows users to work with pandas functions on large datasets by leveraging Spark??s underlying capabilities. References:
✑ Databricks documentation on pandas API on Spark: pandas API on Spark

QUESTION 3

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
Databricks-Machine-Learning-Associate dumps exhibit
They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:
Databricks-Machine-Learning-Associate dumps exhibit
Which of the following lines of code can be used to complete the code block to successfully complete the task?

Correct Answer: B
To apply the Pandas UDFpredictto each record of a Spark DataFrame, you use themapInPandasmethod. This method allows the Pandas UDF to operate on partitions of the DataFrame as pandas DataFrames, applying the specified function (predictin this case) to each partition. The correct code completion to execute this is simply mapInPandas(predict), which specifies the UDF to use without additional arguments orincorrect function calls.References:
✑ PySpark DataFrame documentation (Using mapInPandas with UDFs).

QUESTION 4

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.
In which situation will the machine learning engineer be correct?

Correct Answer: D
If the new solution requires that each of the three models computes a prediction for every record, the time efficiency during inference will be reduced. This is because the inference process now involves running multiple models instead of a single model, thereby increasing the overall computation time for each record.
In scenarios where inference must be done by multiple models for each record, the latency accumulates, making the process less time efficient compared to using a single model. References:
✑ Model Ensemble Techniques

QUESTION 5

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.
The Spark DataFrametrain_dfhas the following schema:
Databricks-Machine-Learning-Associate dumps exhibit
The machine learning engineer shares the following code block:
Databricks-Machine-Learning-Associate dumps exhibit
Which of the following changes does the machine learning engineer need to make to complete the task?

Correct Answer: B
In Spark ML, the linear regression model expects the feature column to be a vector type. However, if thefeaturescolumn in the DataFrametrain_dfis not already in this format (such as being a column of type UDT or a non-vectorized type), the engineerneeds to convert it to a vector column using a transformer likeVectorAssembler. This is a critical step in preparing the data for modeling as Spark ML models require input features to be combined into a single vector column.
References
✑ Spark MLlib documentation forLinearRegression:https://spark.apache.org/docs/latest/ml-classification- regression.html#linear-regression