Azure Data Engineer interview questions with answer | DP-700 Certification Prep.
10 Scenario-Based Azure Data Engineer Interview Questions That Separate Experts From Beginners
Why Scenario-Based Questions Matter
In a real Azure Data Engineering role, no one asks you to âdefine a Data Lake.â They ask: âThe CEO wants a real-time dashboard, but the source system is an on-premise SQL Server from 2008. What do you do?â
This post covers 10 common but brutal scenarios Iâve encountered across fintech, healthcare, and retail migrations. No codeâjust architecture, trade-offs, and Azure-native patterns.
1. The Late-Arriving Fact in Stream Analytics
Scenario:
You built a real-time fraud detection pipeline using Azure Event Hubs â Stream Analytics â Synapse. Suddenly, a network glitch delays a batch of transactions by 2 hours. When they arrive, the dashboard shows âfraud after the factâ and confuses operations.
Question:
How do you handle late-arriving data without reprocessing the entire stream?
You implement watermarking and out-of-order policies in Stream Analytics. Set the
outOfOrderPolicy to Adjust with a late-arrival tolerance window (e.g., 5 minutes). For delays beyond that window, route late events to a separate Event Hub or Blob Storage dead-letter path. Then run an idempotent batch reconciliation job (using Azure Data Factory or Spark) that updates the Synapse dimension tables via MERGE logic (type 2 SCD). Never mutate the original stream sink; instead, maintain a last_updated_ts column and expose corrected facts in a âlate dataâ view. The dashboard should flag corrected vs. real-time records.
2. Partition Explosion in Azure Data Lake Storage (ADLS) Gen2
Scenario:
Your team logs IoT sensor data partitioned by device_id/year/month/day/hour/minute. After 6 months, ADLS reports âthrottlingâ and slow LIST operations. Queries in Synapse Serverless take minutes.
Question:
What went wrong, and how do you fix it without rewriting history?
You hit small file & partition explosionâtoo many subfolders and tiny JSON/Parquet files. Azure Storage has limits on
LIST throughput (~5000 objects/second per prefix). Fix:1. Re-batch historical data using a Spark notebook (Synapse) into coarser partitions:
device_id/year_month_day (drop minute/hour).2. Use Hive-style partitioning (
device_id=123/year=2025/month=03) for partition pruning.3. Compress files to at least 100 MB (Parquet with Snappy).
4. For future streams, use a tumbling window (5 min) in Stream Analytics or Spark Structured Streaming before writing to ADLS.
5. Enable storage lifecycle rules to merge small files after 1 day. Never use minute-based folders for high-cardinality device IDs.
3. The Accidental GDPR Purge Request
Scenario:
A customer submits a âright to be forgottenâ request. Their data exists in: Cosmos DB (transactional), ADLS (raw Parquet), Synapse Dedicated SQL Pool (aggregates), and Power BI cache. You have no native cross-service cascading delete.
Question:
Design a compliant, auditable deletion strategy.
Azure does not support true cascading delete. Implement soft-delete + logical masking instead.
– Cosmos DB: Add a
isDeleted boolean + TTL on the container. Use Change Feed to detect deletes.– ADLS: Move the customerâs files to a
quarantine/ folder via ADF, then apply a retention policy (30 days) before hard delete.– Synapse Dedicated Pool: Replace PII values with
'REDACTED' or NULL, and set isActive=0. Never DELETE rowsâit breaks historical aggregates.– Power BI: Refresh dataset with a filter
isActive=1.– Audit: Log all operations to Azure Log Analytics with a correlation ID. Produce a âdeletion certificateâ using Azure Purviewâs lineage. Key principle: âDelete the pointer, not always the byte.â
4. Exfiltration of Sensitive Data via a Synapse Link
Scenario:
A junior engineer enables Synapse Link for Dataverse to replicate CRM data into ADLS. Two days later, you notice a service principal (from a decommissioned dev environment) reading 2 million rows of customer SSNs from the same container.
Question:
How do you retrospectively detect this, lock it down, and prevent recurrence?
– Detection: Query Azure Storage diagnostic logs (set to
StorageRead events) or Azure Data Explorer for AuthenticationType=ServicePrincipal. Filter by container, time range, and UserAgent.– Immediate lock: Disable the service principal in Azure AD â revoke its RBAC
Storage Blob Data Reader role â enable firewall + private endpoint for the storage account.– Prevention:
– Enforce Azure Policy to block public network access.
– Use Microsoft Purview to classify SSN/PCI columns and auto-apply sensitivity labels.
– Set Azure Synapse RBAC with
deny data read for untrusted principals.– Enable customer-managed key (CMK) with key rotation.
– Remediation: Rotate all keys, force regenerate SAS tokens, and notify compliance within 72 hours (GDPR breach window).
5. ADF Pipeline Slamming a Source OLTP Database
Scenario:
Your Azure Data Factory (ADF) copy activity runs every 15 minutes, using SELECT * FROM Orders with no filter. The on-premise SQL Server (standard tier) starts timing out for the POS system during peak hours.
Question:
How do you reduce load without sacrificing near real-time?
Switch from full table scans to incremental loads with watermarking.
– Add
LastModified column in source (if missing, use CDC or Change Tracking in SQL Server).– In ADF, store the last high-watermark in Azure SQL DB or Blob (watermark.txt).
– Use query pushdown:
WHERE LastModified > '@{watermark}' AND LastModified <= '@{current_utc}'.– Add parallel copy throttling (DIU = 2, not auto).
– Use staging via Blob to avoid long transactions.
– For legacy DBs without timestamp, implement row versioning or a staging table with
BEFORE and AFTER triggers.– Final step: Monitor ADFâs
copy duration and DTU consumption via Log Analytics; set alerts > 70% DTU.
6. Slowly Changing Dimension (SCD) Type 2 in Synapse Spark
Scenario:
You need to track history of customer addresses in a gold layer table. The source system sends full daily extracts (no CDC). A naive overwrite would lose history.
Question:
Explain your upsert logic using Spark (Synapse) without a MERGE statement.
Use delta lake +
merge (which Spark supports via Delta Lake APIs) or implement a 3-step approach:1. Read existing gold table and new source extract.
2. Identify changes (address change) by comparing hash of all business attributes except surrogate keys.
3. Expire old rows: Set
is_current = 0 and valid_to = current_date for changed records.4. Insert new rows with
valid_from = current_date, valid_to = '9999-12-31', is_current = 1.For performance, use bucket by customer_id (16 buckets) and
repartition(200) before writes.Store as Parquet + Delta to enable time travel (
VERSION AS OF). Never use overwrite on the entire table. Wrap in a transaction using spark.sql("OPTIMIZE table ZORDER BY (customer_id)") weekly.
7. Cross-Region Disaster Recovery for Event Hubs
Scenario:
You process 50k events/second from IoT devices. Azure Event Hubs is deployed in East US. The region goes down for 4 hours. Your SLA requires < 15 minutes of data loss.
Question:
Design a DR strategy without rebuilding the consumer group.
Use Event Hubs Geo-DR (alias) with paired region (e.g., West US).
– Enable Alias (Geo-replication) â it replicates metadata (consumer groups, offsets) but not events. To meet <15 min loss:
– Mirror maker pattern: Use Azure Functions or Stream Analytics to forward events to a secondary EH in West US asynchronously.
– Maintain checkpointing in a geo-redundant storage (RA-GZRS) so consumers can resume from last offset.
– For production, deploy active-passive consumers: primary reads from East US; secondary idle. On failover, update connection string to alias â consumers resume from the mirrored offset.
– Data loss: Acceptable window is 15 min; use idempotent writes to ADLS so late duplicates are deduped by
event_id.– Test failover quarterly using Microsoftâs âDR drillâ feature.
8. Synapse Serverless SQL Timing Out on Large File Sets
Scenario:
You query a 5 TB folder with 50,000 small JSON files using OPENROWSET(BULK...). The query times out after 30 minutes, even with FILELIST filter.
Question:
How do you restructure the data or the query to succeed?
Serverless SQL is not designed for tiny files. Fixes:
1. File compaction â Run a Spark job to merge into 200 MBâ1 GB Parquet files.
2. Partition elimination â Use
WHERE filepath() LIKE '/year=2025/month=03%' to prune drastically.3. Enable result set caching â
SET RESULT_SET_CACHING = ON; (first run slow, subsequent fast).4. Use
OPTION (MAXDOP 1) to avoid memory pressure.5. External metadata â Create a Hive table over the folder, then query with
WHERE _metadata.file_modification_time > ....6. Fallback â Move critical queries to Dedicated SQL Pool (which handles small files better via distribution).
Best practice: Never use Serverless for ETL; use it for ad-hoc exploration of well-partitioned Parquet.
9. Pipeline Dependency Hell with ADF and Logic Apps
Scenario:
You have 40 ADF pipelines: Sales â Finance â Inventory â Reporting. A Logic App triggers on Blob creation, which starts another pipeline. One failure at 2 AM causes 8 dependent pipelines to hang. No one knows the root pipeline.
Question:
Implement observability and self-healing retries.
– Observability:
– Add a
run_id to every pipeline via pipeline().runId. Pass it as a parameter to downstream pipelines.– Write each pipelineâs start/end/status to Azure Monitor with custom dimensions (
source_system, depends_on).– Use ADFâs built-in dependency graph (in Monitor hub) to visualize chains.
– Self-healing:
– Set each activity retry = 3, retry interval = 30 sec, exponential backoff.
– For transient failures, use web activity to call a retry endpoint.
– Implement a control table in Azure SQL DB:
pipeline_dependencies (parent_run_id, child_run_id, status). A separate ADF âorchestratorâ checks every 5 minutes for hung child runs and re-triggers them.– Use failure webhook to send alert to Teams + create Azure DevOps bug.
– Avoid Logic Apps for orchestration; prefer ADFâs
Execute Pipeline activity with waitOnCompletion = true.
10. Cost Explosion from Unbounded Synapse Spark Sessions
Scenario:
Your data scientists spin up 30 Synapse Spark pools (each with 3 nodes) for ad-hoc notebooks. They leave sessions running over the weekend. Your monthly Azure bill doubles.
Question:
How do you enforce governance without blocking innovation?
– Auto-pause & auto-scale: Set
auto-pause to 10 minutes idle, min nodes = 3, max nodes = 10.– RBAC + policies:
– Create a custom role
Spark Job Submitter that cannot create pools (only use existing).– Enforce Azure Policy: âDeny creation of Spark pools with node count > 10â and âRequire tag
owner and cost-centerâ.– Budget alerts: Use Azure Cost Management with a monthly budget of $500 for Synapse, trigger at 80% to send email.
– Automated shutdown: Azure Function that runs every hour, queries
spark_session via Synapse REST API, kills sessions older than 4 hours (except those tagged allow_long_running=true).– Education: Provide a shared pool named
analytics_small (3 nodes) for ad-hoc work. Force notebook to attach only to approved pools via %configure header check.
Final Authority Check
If you can explain why you chose Event Hubs Geo-DR over active replication, or when to use Serverless vs. Dedicated poolâyouâre not just an Azure Data Engineer. Youâre the person who gets called at 2 AM and fixes it without breaking the bank.
Want more? In the next post, Iâll break down cost-optimization patterns for Azure Purview and Real-time CDC from Oracle to Fabric. Subscribe below.
đ Learning Query? Chat on WhatsApp

Cybersecurity Architect | Cloud-Native Defense | AI/ML Security | DevSecOps
With over 23 years of experience in cybersecurity, I specialize in building resilient, zero-trust digital ecosystems across multi-cloud (AWS, Azure, GCP) and Kubernetes (EKS, AKS, GKE) environments. My journey began in network securityâfirewalls, IDS/IPSâand expanded into Linux/Windows hardening, IAM, and DevSecOps automation using Terraform, GitLab CI/CD, and policy-as-code tools like OPA and Checkov.
Today, my focus is on securing AI/ML adoption through MLSecOps, protecting models from adversarial attacks with tools like Robust Intelligence and Microsoft Counterfit. I integrate AISecOps for threat detection (Darktrace, Microsoft Security Copilot) and automate incident response with forensics-driven workflows (Elastic SIEM, TheHive).
Whether itâs hardening cloud-native stacks, embedding security into CI/CD pipelines, or safeguarding AI systems, I bridge the gap between security and innovationâensuring defense scales with speed.
Letâs connect and discuss the future of secure, intelligent infrastructure.