2026 New Associate-Developer-Apache-Spark-3.5 Dumps - Real Databricks Exam Questions [Q27-Q47]

Share

2026 New Associate-Developer-Apache-Spark-3.5 Dumps - Real Databricks Exam Questions

Dependable Associate-Developer-Apache-Spark-3.5 Exam Dumps to Become Databricks Certified

NEW QUESTION # 27
Which Spark configuration controls the number of tasks that can run in parallel on the executor?
Options:

  • A. spark.executor.cores
  • B. spark.driver.cores
  • C. spark.executor.memory
  • D. spark.task.maxFailures

Answer: A

Explanation:
spark.executor.cores determines how many concurrent tasks an executor can run.
For example, if set to 4, each executor can run up to 4 tasks in parallel.
Other settings:
spark.task.maxFailures controls task retry logic.
spark.driver.cores is for the driver, not executors.
spark.executor.memory sets memory limits, not task concurrency.


NEW QUESTION # 28
What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

  • A. The operation will fail if the Pandas DataFrame exceeds 1000 rows
  • B. The conversion will automatically distribute the data across worker nodes
  • C. Data will be lost during conversion
  • D. The operation will load all data into the driver's memory, potentially causing memory overflow

Answer: D

Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
When you convert a largepyspark.pandas(aka Pandas API on Spark) DataFrame to a local Pandas DataFrame using.toPandas(), Spark collects all partitions to the driver.
From the Spark documentation:
"Be careful when converting large datasets to Pandas. The entire dataset will be pulled into the driver's memory." Thus, for large datasets, this can cause memory overflow or out-of-memory errors on the driver.
Final Answer: D


NEW QUESTION # 29
A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.
Which code snippet can be used to meet this requirement?

  • A. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")
  • B. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")
  • C. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
  • D. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

Answer: C

Explanation:
To remove specific columns from a PySpark DataFrame, the drop() method is used. This method returns a new DataFrame without the specified columns. The correct syntax for dropping multiple columns is to pass each column name as a separate argument to the drop() method.
Correct Usage:
df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") This line of code will return a new DataFrame df_user_non_pii that excludes the specified PII columns.
Explanation of Options:
A . Correct. Uses the drop() method with multiple column names passed as separate arguments, which is the standard and correct usage in PySpark.
B . Although it appears similar to Option A, if the column names are not enclosed in quotes or if there's a syntax error (e.g., missing quotes or incorrect variable names), it would result in an error. However, as written, it's identical to Option A and thus also correct.
C . Incorrect. The dropfields() method is not a method of the DataFrame class in PySpark. It's used with StructType columns to drop fields from nested structures, not top-level DataFrame columns.
D . Incorrect. Passing a single string with comma-separated column names to dropfields() is not valid syntax in PySpark.
Reference:
PySpark Documentation: DataFrame.drop
Stack Overflow Discussion: How to delete columns in PySpark DataFrame


NEW QUESTION # 30
A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0.
The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

A)

B)

C)

D)

  • A. result_df = prices_df \
    .agg(F.count("spot_price").alias("spot_price")) \
    .filter(F.col("spot_price") > F.lit("min_price"))
  • B. result_df = prices_df \
    .agg(F.min("spot_price"), F.max("spot_price"))
  • C. result_df = prices_df \
    .withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))
  • D. result_df = prices_df \
    .agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

Answer: D

Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The correct answer isBbecause it uses the new function count_if, introduced in Spark 3.5.0, which simplifies conditional counting within aggregations.
* F.count_if(condition) counts the number of rows that meet the specified boolean condition.
* In this example, it directly counts how many times spot_price >= min_price evaluates to true, replacing the older verbose combination of when/otherwise and filtering or summing.
Official Spark 3.5.0 documentation notes the addition of count_if to simplify this kind of logic:
"Added count_if aggregate function to count only the rows where a boolean condition holds (SPARK-
43773)."
Why other options are incorrect or outdated:
* Auses a legacy-style method of adding a flag column (when().otherwise()), which is verbose compared to count_if.
* Cperforms a simple min/max aggregation-useful but unrelated to conditional array operations or the updated functionality.
* Dincorrectly applies .filter() after .agg() which will cause an error, and misuses string "min_price" rather than the variable.
Therefore,Bis the only option leveraging new functionality from Spark 3.5.0 correctly and efficiently.


NEW QUESTION # 31
A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number, transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.
Which approach should the data scientist use to deduplicate the orders using PySpark?

  • A. df = df.dropDuplicates(["transaction_amount"])
  • B. df = df.dropDuplicates()
  • C. df = df.filter(F.col("transaction_id").isNotNull())
  • D. df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first ("timestamp"))

Answer: B

Explanation:
dropDuplicates() with no column list removes duplicates based on all columns.
It's the most efficient and semantically correct way to deduplicate records that are completely identical across all fields.
From the PySpark documentation:
dropDuplicates(): Return a new DataFrame with duplicate rows removed, considering all columns if none are specified.
- Source:PySpark DataFrame.dropDuplicates() API


NEW QUESTION # 32
A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.
Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

  • A. df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")
  • B. df.withColumn("discount", df.purchase_amount * 0.1).select("discount")
  • C. df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)
  • D. df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

Answer: C

Explanation:
Shuffling occurs in operations like groupBy, reduceByKey, or join-which cause data to be moved across partitions. The repartition() operation can also cause a shuffle, but in this context, it follows an aggregation.
In Option D, the groupBy followed by agg results in a shuffle due to grouping across nodes.
The repartition(10) is a partitioning transformation but does not involve a new shuffle since the data is already grouped.
This sequence - shuffle (groupBy) followed by non-shuffling (repartition) - is correct.
Option A does the opposite: the filter does not cause a shuffle, but groupBy does - this makes it the wrong order.


NEW QUESTION # 33
In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

  • A. replace
  • B. aggregate
  • C. complete
  • D. append

Answer: C

Explanation:
The correct output mode for streaming aggregations that need to output the full updated results at each trigger is "complete".
From the official documentation:
"complete: The entire updated result table will be output to the sink every time there is a trigger." This is ideal for aggregations, such as counts or averages grouped by a key, where the result table changes incrementally over time.
append: only outputs newly added rows
replace and aggregate: invalid values for output mode


NEW QUESTION # 34
A data engineer is streaming data from Kafka and requires:
Minimal latency
Exactly-once processing guarantees
Which trigger mode should be used?

  • A. .trigger(continuous=True)
  • B. .trigger(continuous='1 second')
  • C. .trigger(processingTime='1 second')
  • D. .trigger(availableNow=True)

Answer: C

Explanation:
Comprehensive and Detailed Explanation:
Exactly-once guarantees in Spark Structured Streaming require micro-batch mode (default), not continuous mode.
Continuous mode (.trigger(continuous=...)) only supports at-least-once semantics and lacks full fault- tolerance.
trigger(availableNow=True)is a batch-style trigger, not suited for low-latency streaming.
So:
Option A uses micro-batching with a tight trigger interval # minimal latency + exactly-once guarantee.
Final Answer: A


NEW QUESTION # 35
What is the behavior for function date_sub(start, days) if a negative value is passed into the days parameter?

  • A. An error message of an invalid parameter will be returned
  • B. The number of days specified will be added to the start date
  • C. The number of days specified will be removed from the start date
  • D. The same start date will be returned

Answer: B

Explanation:
The function date_sub(start, days) subtracts the number of days from the start date. If a negative number is passed, the behavior becomes a date addition.
Example:
SELECT date_sub('2024-05-01', -5)
-- Returns: 2024-05-06
So, a negative value effectively adds the absolute number of days to the date.


NEW QUESTION # 36
A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.
Which code fragment meets the requirements?
A)

B)

C)

D)

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.
Which code fragment meets the requirements?

  • A. regions = dict(
    regions_df
    .select('region_id', 'region')
    .sort('region_id')
    .take(3)
    )
  • B. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort('region_id')
    .take(3)
    )
  • C. regions = dict(
    regions_df
    .select('region_id', 'region')
    .limit(3)
    .collect()
    )
  • D. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort(desc('region_id'))
    .take(3)
    )

Answer: B

Explanation:
The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.
Key observations:
.select('region', 'region_id') puts the column order as expected by dict() - where the first column becomes the key and the second the value.
.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.
.take(3) retrieves exactly 3 rows.
Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.
Incorrect options:
Option B flips the order to region_id first, resulting in a dictionary with integer keys - not what's asked.
Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.
Option D sorts in descending order, giving the largest rather than smallest region_ids.
Hence, Option A meets all the requirements precisely.


NEW QUESTION # 37
A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?
A)
Use the applylnPandas API
B)

C)

  • A. Use the applyInPandas API:
    df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
  • B. Use a regular Spark UDF:
    from pyspark.sql.functions import mean
    df.groupBy("user_id").agg(mean("value")).show()
  • C. Use the mapInPandas API:
    df.mapInPandas(mean_func, schema="user_id long, value double").show()
  • D. Use a Pandas UDF:
    @pandas_udf("double")
    def mean_func(value: pd.Series) -> float:
    return value.mean()
    df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer: A

Explanation:
The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.
As per the Databricks documentation:
"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution." Option A is correct and achieves this parallel execution.
Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.
Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.
Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.
Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.


NEW QUESTION # 38
A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Which code fragment should be inserted in line 5 to meet the requirement?
Code context:
spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","host1:port1,host2:port2") \
.[LINE5] \
.load()
Options:

  • A. .option("subscribe", "feed")
  • B. .option("kafka.topic", "feed")
  • C. .option("subscribe.topic", "feed")
  • D. .option("topic", "feed")

Answer: A

Explanation:
Comprehensive and Detailed Explanation:
To read from a specific Kafka topic using Structured Streaming, the correct syntax is:
python
CopyEdit
option("subscribe","feed")
This is explicitly defined in the Spark documentation:
"subscribe - The Kafka topic to subscribe to. Only one topic can be specified for this option." (Source:Apache Spark Structured Streaming + Kafka Integration Guide)
B)."subscribe.topic" is invalid.
C)."kafka.topic" is not a recognized option.
D)."topic" is not valid for Kafka source in Spark.


NEW QUESTION # 39
A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.
Which code snippet meets the requirement of the developer?

  • A. df.orderBy(col("age").asc(), col("salary").asc()).show()
  • B. df.sort("age", "salary", ascending=[True, True]).show()
  • C. df.sort("age", "salary", ascending=[False, True]).show()
  • D. df.orderBy("age", "salary", ascending=[True, False]).show()

Answer: D

Explanation:
To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:
python
CopyEdit
df.orderBy("age", "salary", ascending=[True, False])
age will be sorted in ascending order
salary will be sorted in descending order
The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.
Documentation Reference: PySpark API - DataFrame.orderBy


NEW QUESTION # 40
20 of 55.
What is the difference between df.cache() and df.persist() in Spark DataFrame?

  • A. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and cache() - Can be used to set different storage levels.
  • B. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_DESER).
  • C. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and persist() - Can be used to set different storage levels to persist the contents of the DataFrame.
  • D. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

Answer: C

Explanation:
Both cache() and persist() are Spark DataFrame storage operations that store computed results in memory (and optionally on disk) to speed up subsequent actions on the same DataFrame.
Key difference:
cache() is a shorthand for persist(StorageLevel.MEMORY_AND_DISK).
persist() allows specifying different storage levels, such as MEMORY_ONLY, DISK_ONLY, or MEMORY_AND_DISK_SER.
Example:
df.cache() # Uses MEMORY_AND_DISK by default
df.persist(StorageLevel.MEMORY_ONLY) # Custom storage level
Both trigger caching upon an action (e.g., count(), collect()).
Why the other options are incorrect:
A: persist() default is not DISK_ONLY; default storage level is MEMORY_AND_DISK.
B/C: cache() cannot set arbitrary levels; only persist() can.
Reference:
PySpark API Reference - DataFrame.cache() and DataFrame.persist().
Databricks Exam Guide (June 2025): Section "Developing Apache Spark DataFrame/DataSet API Applications" - caching, persistence, and storage levels.


NEW QUESTION # 41
A data engineer is streaming data from Kafka and requires:
Minimal latency
Exactly-once processing guarantees
Which trigger mode should be used?

  • A. .trigger(continuous=True)
  • B. .trigger(continuous='1 second')
  • C. .trigger(processingTime='1 second')
  • D. .trigger(availableNow=True)

Answer: C

Explanation:
Exactly-once guarantees in Spark Structured Streaming require micro-batch mode (default), not continuous mode.
Continuous mode (.trigger(continuous=...)) only supports at-least-once semantics and lacks full fault-tolerance.
trigger(availableNow=True) is a batch-style trigger, not suited for low-latency streaming.
So:
Option A uses micro-batching with a tight trigger interval → minimal latency + exactly-once guarantee.
Final answer: A


NEW QUESTION # 42
38 of 55.
A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.
The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:
Reads directly from /data/input.json.
Infers the schema automatically.
Merges differing schemas.
Which code snippet should the engineer use?

  • A. CREATE EXTERNAL TABLE users
    USING json
    OPTIONS (path '/data/input.json', mergeSchema 'true');
  • B. CREATE EXTERNAL TABLE users
    USING json
    OPTIONS (path '/data/input.json', mergeAll 'true');
  • C. CREATE TABLE users
    USING json
    OPTIONS (path '/data/input.json');
  • D. CREATE EXTERNAL TABLE users
    USING json
    OPTIONS (path '/data/input.json', inferSchema 'true');

Answer: A

Explanation:
To handle JSON files with evolving or differing schemas, Spark SQL supports the option mergeSchema 'true', which merges all fields across files into a unified schema.
Correct syntax:
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeSchema 'true');
This creates an external table directly on the JSON data, inferring schema automatically and merging variations.
Why the other options are incorrect:
B: Missing schema merge configuration - fails with inconsistent files.
C: inferSchema applies to CSV/other file types, not JSON.
D: mergeAll is not a valid Spark SQL option.
Reference:
Spark SQL Data Sources - JSON file options (mergeSchema, path).
Databricks Exam Guide (June 2025): Section "Using Spark SQL" - creating external tables and schema inference for JSON data.


NEW QUESTION # 43
A data engineer is working on the DataFrame:

(Referring to the table image: it has columnsId,Name,count, andtimestamp.) Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?

  • A. df.select("Name").distinct().orderBy(df["Name"].desc())
  • B. df.select("Name").distinct()
  • C. df.select("Name").distinct().orderBy(df["Name"])
  • D. df.select("Name").orderBy(df["Name"].asc())

Answer: C

Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To extract unique values from a column and sort them alphabetically:
distinct()is required to remove duplicate values.
orderBy()is needed to sort the results alphabetically (ascending by default).
Correct code:
df.select("Name").distinct().orderBy(df["Name"])
This is directly aligned with standard DataFrame API usage in PySpark, as documented in the official Databricks Spark APIs. Option A is incorrect because it may not remove duplicates. Option C omits sorting.
Option D sorts in descending order, which doesn't meet the requirement for alphabetical (ascending) order.


NEW QUESTION # 44
A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.
Which save mode and method should be used?

  • A. save with mode ErrorIfExists
  • B. saveAsTable with mode Overwrite
  • C. saveAsTable with mode ErrorIfExists
  • D. save with mode Ignore

Answer: C

Explanation:
The method saveAsTable() creates a new table and optionally fails if the table exists.
From Spark documentation:
"The mode 'ErrorIfExists' (default) will throw an error if the table already exists." Thus:
Option A is correct.
Option B (Overwrite) would overwrite existing data - not acceptable here.
Option C and D use save(), which doesn't create a managed table with metadata in the metastore.
Final answer: A


NEW QUESTION # 45
A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:

import hashlib
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
def shake_256(raw):
return hashlib.shake_256(raw.encode()).hexdigest(20)
shake_256_udf = sf.udf(shake_256, StringType())
The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition ofshake_256_udfto this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType()) However, the developer receives the error:
What should the signature of theshake_256()function be changed to in order to fix this error?

  • A. def shake_256(df: pd.Series) -> str:
  • B. def shake_256(df: pd.Series) -> pd.Series:
  • C. def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
  • D. def shake_256(raw: str) -> str:

Answer: B

Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.
In this case, the original function signature:
def shake_256(raw: str) -> str
is scalar - not compatible with Pandas UDFs.
According to the official Spark documentation:
"Pandas UDFs operate onpandas.Seriesand returnpandas.Series. The function definition should be:
def my_udf(s: pd.Series) -> pd.Series:
and it must be registered usingpandas_udf(...)."
Therefore, to fix the error:
The function should be updated to:
def shake_256(df: pd.Series) -> pd.Series:
return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))
This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.
Reference: Apache Spark 3.5 Documentation # User-Defined Functions # Pandas UDFs


NEW QUESTION # 46
A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.
How can this be achieved?

  • A. By configuring the option recoveryLocation during writeStream
  • B. By configuring the option recoveryLocation during the SparkSession initialization
  • C. By configuring the option checkpointLocation during readStream
  • D. By configuring the option checkpointLocation during writeStream

Answer: D

Explanation:
To enable a Structured Streaming query to recover from failures or intentional shutdowns, it is essential to specify the checkpointLocation option during the writeStream operation. This checkpoint location stores the progress information of the streaming query, allowing it to resume from where it left off.
According to the Databricks documentation:
"You must specify the checkpointLocation option before you run a streaming query, as in the following example:
.option("checkpointLocation", "/path/to/checkpoint/dir")
.toTable("catalog.schema.table")
- Databricks Documentation: Structured Streaming checkpoints
By setting the checkpointLocation during writeStream, Spark can maintain state information and ensure exactly-once processing semantics, which are crucial for reliable streaming applications.


NEW QUESTION # 47
......

Get Ready with Associate-Developer-Apache-Spark-3.5 Exam Dumps (2026): https://www.premiumvcedump.com/Databricks/valid-Associate-Developer-Apache-Spark-3.5-premium-vce-exam-dumps.html

Realistic Associate-Developer-Apache-Spark-3.5 Dumps are Available for Instant Access: https://drive.google.com/open?id=110YMtksPyGAZvflTAVgsnf59hyihWHfn