10 Scenario-Based Azure Data Engineer Interview Questions That Separate Experts From Beginners

Azure Data Engineer interview questions with answer | DP-700 Certification Prep.

10 Scenario-Based Azure Data Engineer Interview Questions | Expert Guide

10 Scenario-Based Azure Data Engineer Interview Questions That Separate Experts From Beginners

📘 pause: Think you know Azure Data Engineering? Move beyond basic SQL and Synapse. Here are 10 real-world crisis scenarios—from pipeline failures to GDPR purges—with technically precise answers to prove your subject authority.

Why Scenario-Based Questions Matter

In a real Azure Data Engineering role, no one asks you to “define a Data Lake.” They ask: “The CEO wants a real-time dashboard, but the source system is an on-premise SQL Server from 2008. What do you do?”

This post covers 10 common but brutal scenarios I’ve encountered across fintech, healthcare, and retail migrations. No code—just architecture, trade-offs, and Azure-native patterns.

1. The Late-Arriving Fact in Stream Analytics

Scenario:
You built a real-time fraud detection pipeline using Azure Event Hubs → Stream Analytics → Synapse. Suddenly, a network glitch delays a batch of transactions by 2 hours. When they arrive, the dashboard shows “fraud after the fact” and confuses operations.

Question:
How do you handle late-arriving data without reprocessing the entire stream?

Answer:
You implement watermarking and out-of-order policies in Stream Analytics. Set the outOfOrderPolicy to Adjust with a late-arrival tolerance window (e.g., 5 minutes). For delays beyond that window, route late events to a separate Event Hub or Blob Storage dead-letter path. Then run an idempotent batch reconciliation job (using Azure Data Factory or Spark) that updates the Synapse dimension tables via MERGE logic (type 2 SCD). Never mutate the original stream sink; instead, maintain a last_updated_ts column and expose corrected facts in a “late data” view. The dashboard should flag corrected vs. real-time records.

2. Partition Explosion in Azure Data Lake Storage (ADLS) Gen2

Scenario:
Your team logs IoT sensor data partitioned by device_id/year/month/day/hour/minute. After 6 months, ADLS reports “throttling” and slow LIST operations. Queries in Synapse Serverless take minutes.

Question:
What went wrong, and how do you fix it without rewriting history?

Answer:
You hit small file & partition explosion—too many subfolders and tiny JSON/Parquet files. Azure Storage has limits on LIST throughput (~5000 objects/second per prefix). Fix:
1. Re-batch historical data using a Spark notebook (Synapse) into coarser partitions: device_id/year_month_day (drop minute/hour).
2. Use Hive-style partitioning (device_id=123/year=2025/month=03) for partition pruning.
3. Compress files to at least 100 MB (Parquet with Snappy).
4. For future streams, use a tumbling window (5 min) in Stream Analytics or Spark Structured Streaming before writing to ADLS.
5. Enable storage lifecycle rules to merge small files after 1 day. Never use minute-based folders for high-cardinality device IDs.

3. The Accidental GDPR Purge Request

Scenario:
A customer submits a “right to be forgotten” request. Their data exists in: Cosmos DB (transactional), ADLS (raw Parquet), Synapse Dedicated SQL Pool (aggregates), and Power BI cache. You have no native cross-service cascading delete.

Question:
Design a compliant, auditable deletion strategy.

Answer:
Azure does not support true cascading delete. Implement soft-delete + logical masking instead.
Cosmos DB: Add a isDeleted boolean + TTL on the container. Use Change Feed to detect deletes.
ADLS: Move the customer’s files to a quarantine/ folder via ADF, then apply a retention policy (30 days) before hard delete.
Synapse Dedicated Pool: Replace PII values with 'REDACTED' or NULL, and set isActive=0. Never DELETE rows—it breaks historical aggregates.
Power BI: Refresh dataset with a filter isActive=1.
Audit: Log all operations to Azure Log Analytics with a correlation ID. Produce a “deletion certificate” using Azure Purview’s lineage. Key principle: “Delete the pointer, not always the byte.”

4. Exfiltration of Sensitive Data via a Synapse Link

Scenario:
A junior engineer enables Synapse Link for Dataverse to replicate CRM data into ADLS. Two days later, you notice a service principal (from a decommissioned dev environment) reading 2 million rows of customer SSNs from the same container.

Question:
How do you retrospectively detect this, lock it down, and prevent recurrence?

Answer:
Detection: Query Azure Storage diagnostic logs (set to StorageRead events) or Azure Data Explorer for AuthenticationType=ServicePrincipal. Filter by container, time range, and UserAgent.
Immediate lock: Disable the service principal in Azure AD → revoke its RBAC Storage Blob Data Reader role → enable firewall + private endpoint for the storage account.
Prevention:
– Enforce Azure Policy to block public network access.
– Use Microsoft Purview to classify SSN/PCI columns and auto-apply sensitivity labels.
– Set Azure Synapse RBAC with deny data read for untrusted principals.
– Enable customer-managed key (CMK) with key rotation.
Remediation: Rotate all keys, force regenerate SAS tokens, and notify compliance within 72 hours (GDPR breach window).

5. ADF Pipeline Slamming a Source OLTP Database

Scenario:
Your Azure Data Factory (ADF) copy activity runs every 15 minutes, using SELECT * FROM Orders with no filter. The on-premise SQL Server (standard tier) starts timing out for the POS system during peak hours.

Question:
How do you reduce load without sacrificing near real-time?

Answer:
Switch from full table scans to incremental loads with watermarking.
– Add LastModified column in source (if missing, use CDC or Change Tracking in SQL Server).
– In ADF, store the last high-watermark in Azure SQL DB or Blob (watermark.txt).
– Use query pushdown: WHERE LastModified > '@{watermark}' AND LastModified <= '@{current_utc}'.
– Add parallel copy throttling (DIU = 2, not auto).
– Use staging via Blob to avoid long transactions.
– For legacy DBs without timestamp, implement row versioning or a staging table with BEFORE and AFTER triggers.
– Final step: Monitor ADF’s copy duration and DTU consumption via Log Analytics; set alerts > 70% DTU.

6. Slowly Changing Dimension (SCD) Type 2 in Synapse Spark

Scenario:
You need to track history of customer addresses in a gold layer table. The source system sends full daily extracts (no CDC). A naive overwrite would lose history.

Question:
Explain your upsert logic using Spark (Synapse) without a MERGE statement.

Answer:
Use delta lake + merge (which Spark supports via Delta Lake APIs) or implement a 3-step approach:
1. Read existing gold table and new source extract.
2. Identify changes (address change) by comparing hash of all business attributes except surrogate keys.
3. Expire old rows: Set is_current = 0 and valid_to = current_date for changed records.
4. Insert new rows with valid_from = current_date, valid_to = '9999-12-31', is_current = 1.
For performance, use bucket by customer_id (16 buckets) and repartition(200) before writes.
Store as Parquet + Delta to enable time travel (VERSION AS OF). Never use overwrite on the entire table. Wrap in a transaction using spark.sql("OPTIMIZE table ZORDER BY (customer_id)") weekly.

7. Cross-Region Disaster Recovery for Event Hubs

Scenario:
You process 50k events/second from IoT devices. Azure Event Hubs is deployed in East US. The region goes down for 4 hours. Your SLA requires < 15 minutes of data loss.

Question:
Design a DR strategy without rebuilding the consumer group.

Answer:
Use Event Hubs Geo-DR (alias) with paired region (e.g., West US).
– Enable Alias (Geo-replication) – it replicates metadata (consumer groups, offsets) but not events. To meet <15 min loss:
Mirror maker pattern: Use Azure Functions or Stream Analytics to forward events to a secondary EH in West US asynchronously.
– Maintain checkpointing in a geo-redundant storage (RA-GZRS) so consumers can resume from last offset.
– For production, deploy active-passive consumers: primary reads from East US; secondary idle. On failover, update connection string to alias – consumers resume from the mirrored offset.
Data loss: Acceptable window is 15 min; use idempotent writes to ADLS so late duplicates are deduped by event_id.
Test failover quarterly using Microsoft’s “DR drill” feature.

8. Synapse Serverless SQL Timing Out on Large File Sets

Scenario:
You query a 5 TB folder with 50,000 small JSON files using OPENROWSET(BULK...). The query times out after 30 minutes, even with FILELIST filter.

Question:
How do you restructure the data or the query to succeed?

Answer:
Serverless SQL is not designed for tiny files. Fixes:
1. File compaction – Run a Spark job to merge into 200 MB–1 GB Parquet files.
2. Partition elimination – Use WHERE filepath() LIKE '/year=2025/month=03%' to prune drastically.
3. Enable result set caching – SET RESULT_SET_CACHING = ON; (first run slow, subsequent fast).
4. Use OPTION (MAXDOP 1) to avoid memory pressure.
5. External metadata – Create a Hive table over the folder, then query with WHERE _metadata.file_modification_time > ....
6. Fallback – Move critical queries to Dedicated SQL Pool (which handles small files better via distribution).
Best practice: Never use Serverless for ETL; use it for ad-hoc exploration of well-partitioned Parquet.

9. Pipeline Dependency Hell with ADF and Logic Apps

Scenario:
You have 40 ADF pipelines: Sales → Finance → Inventory → Reporting. A Logic App triggers on Blob creation, which starts another pipeline. One failure at 2 AM causes 8 dependent pipelines to hang. No one knows the root pipeline.

Question:
Implement observability and self-healing retries.

Answer:
Observability:
– Add a run_id to every pipeline via pipeline().runId. Pass it as a parameter to downstream pipelines.
– Write each pipeline’s start/end/status to Azure Monitor with custom dimensions (source_system, depends_on).
– Use ADF’s built-in dependency graph (in Monitor hub) to visualize chains.
Self-healing:
– Set each activity retry = 3, retry interval = 30 sec, exponential backoff.
– For transient failures, use web activity to call a retry endpoint.
– Implement a control table in Azure SQL DB: pipeline_dependencies (parent_run_id, child_run_id, status). A separate ADF “orchestrator” checks every 5 minutes for hung child runs and re-triggers them.
– Use failure webhook to send alert to Teams + create Azure DevOps bug.
– Avoid Logic Apps for orchestration; prefer ADF’s Execute Pipeline activity with waitOnCompletion = true.

10. Cost Explosion from Unbounded Synapse Spark Sessions

Scenario:
Your data scientists spin up 30 Synapse Spark pools (each with 3 nodes) for ad-hoc notebooks. They leave sessions running over the weekend. Your monthly Azure bill doubles.

Question:
How do you enforce governance without blocking innovation?

Answer:
Auto-pause & auto-scale: Set auto-pause to 10 minutes idle, min nodes = 3, max nodes = 10.
RBAC + policies:
– Create a custom role Spark Job Submitter that cannot create pools (only use existing).
– Enforce Azure Policy: “Deny creation of Spark pools with node count > 10” and “Require tag owner and cost-center”.
Budget alerts: Use Azure Cost Management with a monthly budget of $500 for Synapse, trigger at 80% to send email.
Automated shutdown: Azure Function that runs every hour, queries spark_session via Synapse REST API, kills sessions older than 4 hours (except those tagged allow_long_running=true).
Education: Provide a shared pool named analytics_small (3 nodes) for ad-hoc work. Force notebook to attach only to approved pools via %configure header check.

Final Authority Check

If you can explain why you chose Event Hubs Geo-DR over active replication, or when to use Serverless vs. Dedicated pool—you’re not just an Azure Data Engineer. You’re the person who gets called at 2 AM and fixes it without breaking the bank.

Want more? In the next post, I’ll break down cost-optimization patterns for Azure Purview and Real-time CDC from Oracle to Fabric. Subscribe below.


Author:Devraj Sarkar; Principal AI and Data Security Architect @ [AEM].

📘 Learning Query? Chat on WhatsApp

Leave a Reply

Your email address will not be published. Required fields are marked *