00:00

QUESTION 6

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

Correct Answer: A
The suggestion not to one-hot encode categorical feature variables within the feature repository is justified because one-hot encoding can be problematic for some machine learning algorithms. Specifically, one-hot encoding increases the dimensionality of the data, which can be computationally expensive and may lead to issues such as multicollinearity and overfitting. Additionally, some algorithms, such as tree-based methods, can handle categorical variables directly without requiring one-hot encoding.
References:
✑ Databricks documentation on feature engineering: Feature Engineering

QUESTION 7

Which of the following machine learning algorithms typically uses bagging?

Correct Answer: C
Random Forest is a machine learning algorithm that typically uses bagging (Bootstrap Aggregating). Bagging is a technique that involves training multiple base models (such as decision trees) on different subsets of the data and then combining their predictions to improve overall model performance. Each subset is created by randomly sampling with replacement from the original dataset. The Random Forest algorithm builds multiple decision trees and merges them to get a more accurate and stable prediction. References:
✑ Databricks documentation on Random Forest: Random Forest in Spark ML

QUESTION 8

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Correct Answer: E
The correct method to randomly split a Spark DataFrame into training and test sets is by using therandomSplitmethod. This method allows you to specify the proportions for the split as a list of weights and returns multiple DataFrames according to those weights. This is directly intended for splitting DataFrames randomly and is the appropriate choice for preparing data for training and testing in machine learning workflows.References:
✑ Apache Spark DataFrame API documentation (DataFrame Operations: randomSplit).

QUESTION 9

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.
They have written the following incomplete code block:
Databricks-Machine-Learning-Associate dumps exhibit
Which of the following pieces of code can be used to fill in the above blank to complete the task?

Correct Answer: B
The functionmapInPandasin the PySpark DataFrame API allows for applying a function to each partition of the DataFrame. When working with grouped data,groupbyfollowed by applyInPandasis the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if the function should apply across each partition of the grouped data rather than on each individual group,mapInPandaswould be utilized. Since the code snippet indicates the use ofgroupby, the intent seems to be to applytrain_model on each group specifically, which aligns withapplyInPandas. Thus,applyInPandasis a better fit to ensure that each group generated bygroupbyis processed through the train_modelfunction, preserving the partitioning and grouping integrity.
References
✑ PySpark Documentation on applying functions to grouped data:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Gro upedData.applyInPandas.html

QUESTION 10

A data scientist is working with a feature set with the following schema:
Databricks-Machine-Learning-Associate dumps exhibit
Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.
Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Correct Answer: B
For the feature set schema provided, the columns that need to be imputed using the most common value (mode) are typically the categorical columns. In this case, loyalty_tieris the only categorical column that should be imputed using the most common value.customer_idis a unique identifier and should not be imputed, whilespendandunits are numerical columns that should typically be imputed using the mean or median values, not the mode.
References:
✑ Databricks documentation on missing value imputation: Handling Missing Data
If you need any further clarification or additional questions answered, please let me know!