SSIS — SA Migration Guide¶

Purpose: Give a Solution Architect enough depth to assess an SSIS estate, understand its moving parts, and map a migration path to Databricks.

This is not a developer guide. You won't be building SSIS packages. You will be walking customer sites, reviewing architecture diagrams, asking the right questions, and scoping what it takes to move to a modern lakehouse platform.

Architecture Diagrams¶

SSIS Platform Architecture¶

How SSIS fits into the broader Microsoft data stack — from developer tooling through deployment and runtime execution.

Scroll to zoom · Drag to pan

flowchart TD
    subgraph DEV["Developer Layer"]
        SSDT["SSDT / Visual Studio\nBuild & edit packages visually"]
        PROJ[".dtsproj Project\nPackages + parameters + connection managers"]
        SSDT -- saves --> PROJ
    end

    subgraph DEPLOY["Deployment Models"]
        PROJMODEL["Project Deployment Model\nDeploy entire project to SSIS Catalog\n(recommended, post-2012)"]
        PKGMODEL["Package Deployment Model\nDeploy individual packages\n(legacy, pre-2012)"]
    end

    PROJ -- deploy --> PROJMODEL
    PROJ -- deploy --> PKGMODEL

    subgraph CATALOG["SSIS Catalog (SSISDB)"]
        SSISDB[("SSISDB\nSQL Server database\nStores packages, parameters,\nenvironments, execution history")]
        ENVS["Environments\nDEV / QA / PROD\nVariable overrides per env"]
        SSISDB -- references --> ENVS
    end

    PROJMODEL -- stored in --> CATALOG

    subgraph RUNTIME["Runtime / Execution"]
        AGENT["SQL Server Agent\nSchedules & triggers packages"]
        DTEXEC["dtexec.exe\nPackage execution engine"]
        AGENT -- launches --> DTEXEC
        ADFIR["Azure-SSIS IR\n(Azure Data Factory)\nCloud lift-and-shift runtime"]
    end

    CATALOG -- executed by --> RUNTIME

    subgraph OPS["Operations & Monitoring"]
        SSMS["SSMS / SSIS Dashboard\nExecution history, row counts, errors"]
        REPORTS["SSISDB Reports\nBuilt-in SQL Server Reporting"]
    end

    RUNTIME -- logs to --> CATALOG
    CATALOG -- viewed via --> OPS

SSIS as ETL — Data Flow Between Systems¶

How SSIS sits between source systems and targets in a typical Microsoft-centric enterprise pipeline.

Scroll to zoom · Drag to pan

flowchart LR
    subgraph SOURCES["Source Systems"]
        SQLSRV[("SQL Server\n/ Azure SQL")]
        ORACLE[("Oracle / DB2\n/ Other RDBMS")]
        FILES["Flat Files\nCSV / Fixed-width / Excel"]
        SHAREPOINT["SharePoint\n/ Web Services"]
        CLOUD["Cloud Sources\nBlob Storage / REST APIs"]
    end

    subgraph SSIS_LAYER["SSIS ETL Layer"]
        direction TB
        CF["Control Flow\nOrchestrates execution order\nExecute Package · For Loop · Foreach\nScript Task · SQL Task · FTP Task"]
        DF["Data Flow Task\nIn-memory pipeline engine\nSource → Transform → Destination"]
        STAGING[("Staging DB / Flat Files\nIntermediate storage\nbetween pipeline stages")]
        PARAMS["Parameters & Variables\nRuntime configuration\nConnection strings, date ranges"]

        CF -- contains --> DF
        DF -- writes interim results --> STAGING
        STAGING --> DF
        PARAMS -. configures .-> CF
    end

    subgraph TARGETS["Target Systems"]
        DW[("SQL Server DW\n/ Azure Synapse")]
        DATALAKE["Azure Data Lake\nADLS Gen2 / Blob"]
        RPTDB[("Reporting DB\n/ SSAS Cube")]
        DOWNSTREAM["Downstream Apps\nvia file or DB write"]
    end

    subgraph ORCH["Orchestration"]
        SQLAGENT["SQL Server Agent\nJob schedule + step chain"]
        ADF["Azure Data Factory\nPipeline triggers (cloud)"]
    end

    SOURCES --> DF
    DF --> TARGETS
    SQLAGENT -. schedules .-> CF
    ADF -. triggers .-> CF

Sections¶

Ecosystem Overview
Packages — The Core Building Block
Data Formats and Schema
Parallelism and Scaling Model
Project Structure and Version Control
Orchestration: SQL Server Agent and ADF
Metadata, Lineage, and Impact Analysis
Data Quality
SSIS File Formats Reference
Migration Assessment and Artifact Inventory
Migration Mapping to Databricks

1. Ecosystem Overview¶

What Is SSIS?¶

SQL Server Integration Services (SSIS) is Microsoft's enterprise ETL and data integration platform. It has been bundled with SQL Server since 2005 and is the dominant ETL tool in Microsoft-centric enterprise environments — banks, retail, healthcare, government, and manufacturing shops that standardized on the SQL Server stack.

SSIS is a visual, task-based workflow engine. Developers build packages in Visual Studio (using SQL Server Data Tools / SSDT), deploying them to a SQL Server instance that stores, schedules, and monitors execution. It is deeply integrated with SQL Server, Active Directory, and the broader Microsoft platform.

Unlike cloud-native ETL tools, SSIS is:

On-premises first — designed to run on Windows servers; cloud use requires Azure-SSIS IR (a managed VM cluster in Azure Data Factory), which is lift-and-shift, not native cloud
Single-node by default — SSIS does not distribute processing across a cluster; scale comes from the SQL Server host machine
Proprietary packaging — packages are .dtsx files (XML), which are readable but not portable to other platforms without rewriting
Tightly coupled to SQL Server — many patterns (staging tables, stored procedure calls, linked servers) assume SQL Server as the hub

The SSIS Product Ecosystem¶

Customers with large SSIS estates often also have adjacent Microsoft data tools. Understanding the full stack matters for scoping.

Product	What It Does	Migration Relevance
SSIS	ETL package execution engine	High — all transformation logic lives here
SSISDB (SSIS Catalog)	SQL Server database storing packages, environments, execution history	High — inventory and execution logs live here
SQL Server Agent	Windows-based job scheduler; triggers SSIS packages	High — all scheduling to be replaced
SSDT / Visual Studio	IDE for building and debugging SSIS packages	Medium — development tooling only
SSMS	SQL Server Management Studio; used to deploy, run, and monitor packages	Medium — operational tooling
SSIS Scale Out	Multi-server parallel execution feature (SQL Server 2017+)	Low — rarely deployed; note if present
Azure-SSIS IR	Azure Data Factory managed runtime to run SSIS packages in cloud	Medium — customers on Azure may use this as a migration stepping stone
Data Quality Services (DQS)	Standalone Microsoft data quality cleansing/matching tool	Low-Medium — note if actively used; requires separate migration
Master Data Services (MDS)	Reference data and MDM tool	Low — separate product; flag if in scope

SA Tip: Ask whether the customer is on Azure-SSIS IR already. If yes, they've done a lift-and-shift to the cloud but haven't modernized — this is actually a strong migration entry point because they're already paying cloud costs and feeling the limitations of running SSIS on a managed VM cluster.

Why Customers Want to Migrate¶

Driver	What It Means for the Engagement
Cloud strategy	Moving off on-premises SQL Server eliminates the SSIS runtime — packages need a new home
SQL Server license cost	SQL Server Enterprise (required for full SSIS capability) is expensive; migration reduces licensing
Performance limits	SSIS runs on a single node; large datasets hit memory and CPU ceilings that Spark doesn't have
Developer productivity	SSIS development in Visual Studio is slow; Databricks notebooks and DLT pipelines are faster to iterate
Vendor ecosystem	Customers moving to Azure broadly want cloud-native tools, not a VM running SSIS packages
Talent	New data engineers don't know SSIS; they know Python, SQL, and Spark

SA Tip: The most common trigger is a SQL Server version upgrade or cloud migration project that surfaces SSIS as a dependency. The customer wasn't planning to migrate ETL — they discovered they have to. This means they often haven't assessed scope yet. The first conversation is often about scoping, not selling.

Key Discovery Questions¶

Before scoping a migration, ask:

How many packages are in active production use? (vs. total packages in SSISDB — there will be significant dead code)
Are packages deployed in Project Deployment Model or the legacy Package Deployment Model?
What is the SQL Server version and is the customer on Azure-SSIS IR or on-premises?
What is the SQL Server Agent job structure — how many jobs, how many steps per job?
Are there Script Tasks or Script Components with custom C# or VB.NET code? How many?
What are the primary source and target systems? (SQL Server tables, flat files, Oracle, SAP, cloud storage?)
Are packages organized in Visual Studio solutions with Git, or are they stored loose in the file system or SSISDB only?
Is Data Quality Services (DQS) in use, or are quality checks embedded in packages?
What are the SLAs for critical nightly batch jobs?
Are there SSIS packages calling stored procedures with heavy business logic? How much logic is in SQL vs. SSIS?

2. Packages — The Core Building Block¶

The Package¶

In SSIS, the package is the fundamental unit of work — equivalent to a pipeline or job in modern ETL tools. A package contains two main design surfaces:

Control Flow — the orchestration layer; defines what tasks run, in what order, with what conditions
Data Flow — the data pipeline layer; defines how data moves from source to destination through transformations

A package is saved as a .dtsx file — an XML document that encodes every task, component, connection, and expression in the package.

A single SSIS project can have dozens to hundreds of packages, often organized by source system, subject area, or processing stage.

Control Flow¶

The Control Flow is the top-level canvas in SSIS. It defines the workflow logic — sequencing of tasks, looping over files or records, conditional branching, and error handling.

Core Control Flow task types:

Task Type	What It Does	Databricks Equivalent
Execute SQL Task	Runs a SQL statement against a connection	`spark.sql()` / notebook cell
Data Flow Task	Runs a data pipeline (source → transform → destination)	DLT pipeline / notebook ETL code
Execute Package Task	Calls another SSIS package	Databricks Workflow task dependency
Script Task	Runs custom C# or VB.NET code	Python notebook cell / utility function
For Loop Container	Iterates N times	Python `for` loop / Workflow loop
Foreach Loop Container	Iterates over a collection (files, rows, variables)	Python list comprehension / Auto Loader
Sequence Container	Groups tasks for organization or error handling	Task group in Workflow
File System Task	Copies, moves, or deletes files	DBFS/ADLS file operations in Python
FTP Task	Transfers files via FTP	Python ftplib / partner connector
Send Mail Task	Sends email notification	Databricks alert / webhook notification

Precedence Constraints connect tasks in the Control Flow and define when a downstream task runs: on success, on failure, or on completion of the upstream task. They can include expression-based conditions.

Data Flow¶

The Data Flow Task is where actual data transformation happens. Inside a Data Flow, you build a pipeline of components — sources, transformations, and destinations — connected by data paths.

Core Data Flow component categories:

Category	Examples	What They Do
Sources	OLE DB Source, Flat File Source, Excel Source, ADO.NET Source	Read data from connections
Row Transforms	Derived Column, Data Conversion, Character Map	Field-level transformations
Split / Route	Conditional Split, Multicast	Filter rows or copy stream to multiple outputs
Join / Lookup	Merge Join, Lookup, Cache Transform	Combining datasets, enrichment
Aggregate	Aggregate, Sort	Group-by aggregations, ordering
Row counting / sampling	Row Count, Row Sampling, Percentage Sampling	Metrics collection, testing
Fuzzy matching	Fuzzy Lookup, Fuzzy Grouping	Probabilistic matching and deduplication
Script Component	Script Component (Source / Transform / Destination)	Custom C# or VB.NET transformation logic
Destinations	OLE DB Destination, Flat File Destination, SQL Server Destination	Write data to targets

SA Tip: Script Components are the hardest part of any SSIS migration. They contain arbitrary C# or VB.NET business logic that has no direct equivalent — each one requires a developer to read, understand, and rewrite the logic in Python or Spark. Count them early; they are the primary driver of migration effort.

Variables and Expressions¶

SSIS packages use Variables to store runtime values — counters, file paths, date ranges, row counts, status flags. Variables can be scoped to the package or to a specific container.

Expressions are a lightweight scripting language embedded in SSIS that evaluates at runtime. They are used to: - Build dynamic file paths ("C:\\data\\" + @[User::RunDate] + "_customers.csv") - Set conditional precedence constraints - Parameterize task properties

Migration relevance: Variables and expressions containing business logic (date arithmetic, conditional paths) must be extracted and rewritten in Python or SQL during migration. They are rarely documented outside the package XML.

Connection Managers¶

Connection Managers define how packages connect to external systems — database connection strings, file paths, FTP servers, web service URLs. They are defined at the package or project level and referenced by source/destination components.

In the Project Deployment Model, connection managers can be overridden per environment (dev/qa/prod) using environment variable references — this is the SSIS approach to configuration management across environments.

SA Tip: Ask whether connection strings are stored directly in packages, in project-level connection managers, or parameterized via environment references. Hard-coded connection strings in individual packages is a common technical debt pattern — it means each package has its own idea of where the database is, which complicates any infrastructure change and creates migration risk.

3. Data Formats and Schema¶

How SSIS Describes Data¶

SSIS uses metadata — not external schema files — to describe data moving through pipelines. Column names, data types, and lengths are embedded directly in the .dtsx XML for each Data Flow component. This means:

There is no central schema registry in SSIS (unlike Ab Initio's DML files)
Schema is defined per-component, per-package
Changes to upstream schemas break downstream components — SSIS validates metadata at design time but not at runtime against live data

SSIS native data types:

SSIS Type	Maps To	Databricks Equivalent
`DT_STR`	ANSI string (fixed length)	StringType
`DT_WSTR`	Unicode string (fixed length)	StringType
`DT_I4`	32-bit integer	IntegerType
`DT_I8`	64-bit integer	LongType
`DT_R4` / `DT_R8`	Single / Double float	FloatType / DoubleType
`DT_NUMERIC`	Decimal with precision/scale	DecimalType(p,s)
`DT_DATE` / `DT_DBTIMESTAMP`	Date / Datetime	DateType / TimestampType
`DT_BOOL`	Boolean	BooleanType
`DT_BYTES`	Binary	BinaryType
`DT_IMAGE`	Variable-length binary (BLOB)	BinaryType

SA Tip: SSIS's DT_STR vs. DT_WSTR (ANSI vs. Unicode) mismatch is a classic source of silent data truncation and corruption. In migration, assume all strings should be Unicode (StringType in Spark) and flag any packages that mix string types — these are latent data quality risks.

File Formats SSIS Reads and Writes¶

Format	How SSIS Uses It	Migration Note
Delimited CSV	Flat File Source/Destination	Read directly in Databricks with `spark.read.csv()`
Fixed-width flat file	Flat File Source with column positions	Requires format definition; parse with `spark.read.text()` + regex
Excel (.xls / .xlsx)	Excel Source — requires 32-bit ACE driver	Problematic — use openpyxl in Python or convert to CSV first
XML	XML Source — maps XML elements to rows	Parse with `spark.read.xml()` (Databricks native) or `lxml`
JSON	Script Component (no native source pre-2019)	`spark.read.json()` — straightforward
ORC / Parquet	Not natively supported	N/A — SSIS estates don't typically produce these

SA Tip: Excel as a source is a significant risk flag. The SSIS ACE/Jet driver for Excel is 32-bit only, brittle, and a frequent source of production failures. Customers relying on Excel as a source system often have informal data entry processes that will need to be redesigned as part of any migration.

4. Parallelism and Scaling Model¶

How SSIS Achieves Performance¶

SSIS is not a distributed processing engine. It runs on a single machine and scales vertically — by adding more CPU, RAM, and faster storage to the SSIS server. This is the most important architectural difference to convey to a customer comparing SSIS performance to Databricks.

Within a single package, SSIS achieves limited parallelism through:

Multiple Data Flow Tasks running in parallel — the Control Flow can execute multiple tasks simultaneously if they have no dependencies
Data Flow pipeline streaming — within a Data Flow, components process rows in a pipeline fashion (source is reading while transformations are computing)
Asynchronous transformations — transforms like Sort and Aggregate must consume all rows before emitting output (blocking); most others are row-by-row (non-blocking)

SSIS Scale Out (Rare)¶

SQL Server 2017 introduced SSIS Scale Out — a master/worker architecture that distributes package execution across multiple worker servers. It is rarely deployed in the field because: - It requires significant infrastructure setup - It only parallelizes at the package level, not within a Data Flow - Most customers who need true distributed processing have moved to other tools

SA Tip: If a customer mentions SSIS Scale Out, they have an unusually large SSIS estate that has hit single-node limits. This is actually a good signal — they already know they need more than SSIS can offer and are primed for the Databricks pitch.

Buffer-Based Execution¶

SSIS Data Flow processes rows in memory buffers. The SSIS engine reads rows from the source into buffer memory, passes buffers through transformation components, and writes to the destination. Buffer size is configurable (DefaultBufferSize, DefaultBufferMaxRows).

This means: - Performance is heavily dependent on available RAM - Large transformations requiring full dataset access (Sort, Aggregate, Merge Join) can cause buffer spills to disk - There is no concept of distributed shuffle — everything happens on one machine

SSIS Scaling Factor	Databricks Equivalent
Machine RAM (buffer size)	Cluster memory across all workers
CPU cores on SSIS server	Total vCores across Spark executors
Disk I/O for buffer spills	Managed by Spark shuffle service
Parallelism via multiple packages	Spark partitions within a single job

5. Project Structure and Version Control¶

Deployment Models¶

SSIS has two deployment models that determine how packages are stored, configured, and deployed:

Model	Introduced	How It Works	Version Control
Project Deployment Model	SQL Server 2012	Entire `.ispac` project file deployed to SSISDB	Recommended — project stays in Git as `.dtsproj`
Package Deployment Model	SQL Server 2005	Individual `.dtsx` packages deployed to file system or MSDB	Legacy — packages often not in any VCS

SA Tip: If the customer is on Package Deployment Model, assume version control is weak or absent. Packages in MSDB (the SQL Server system database) are particularly hard to inventory — you need to extract them via SSMS or T-SQL before you can even count them accurately.

Visual Studio Project Structure¶

When packages are managed in Visual Studio / SSDT, the project structure looks like:

Solution: EnterpriseETL.sln
 └── Project: Finance_ETL.dtsproj
       ├── Connection Managers/
       │     ├── DW_Connection.conmgr
       │     └── Staging_Connection.conmgr
       ├── Project.params              ← project-level parameters
       ├── Packages/
       │     ├── Extract_Customers.dtsx
       │     ├── Transform_Accounts.dtsx
       │     └── Load_DW_Facts.dtsx
       └── Environments/              ← SSISDB environment references
             ├── DEV.xml
             └── PROD.xml

Artifact	File Extension	What It Contains
Package	`.dtsx`	A single ETL workflow (Control Flow + Data Flow)
Project file	`.dtsproj`	Package list, project-level connection managers, parameters
Solution file	`.sln`	Visual Studio solution grouping one or more projects
Project parameters	`.params`	Shared parameters available to all packages in the project
Connection manager	`.conmgr`	Shared connection definition at project level
Deployment package	`.ispac`	Compiled, deployable version of a project

Version Control Reality¶

SSIS and Git have a complicated relationship. .dtsx files are XML, which means: - They are technically diffable, but the XML is verbose and auto-generated, making diffs hard to read - Visual Studio makes minor formatting changes on every save, creating noisy commits - Many teams give up on meaningful code review and treat the package as a binary

As a result, many SSIS estates have nominal Git tracking — packages are committed but PRs are never reviewed and history is not meaningful. The production state in SSISDB is often ahead of or different from what's in Git.

SA Tip: Before trusting the Git repository as the source of truth, ask: "When was the last time you did a code review on an SSIS package?" If the answer is uncertain, treat SSISDB as the authoritative source and export packages directly from there for inventory.

6. Orchestration: SQL Server Agent and ADF¶

SQL Server Agent¶

SQL Server Agent is the primary scheduler for SSIS in on-premises environments. It is a Windows service built into SQL Server that runs Jobs on a schedule. Each Job has one or more Steps, and each step can execute a different action — running an SSIS package, executing T-SQL, running a PowerShell script, and so on.

Key SQL Server Agent concepts:

Concept	Description	Databricks Equivalent
Job	A named scheduled task with one or more steps	Databricks Workflow
Step	A single action within a job (e.g., run SSIS package)	Workflow Task
Schedule	Time-based trigger (cron-style or recurring)	Workflow schedule
Alert	Trigger on SQL Server event or performance condition	Databricks alert / webhook
Operator	Named recipient for job failure notifications	Notification contact in Workflow
Job History	Run log: success/failure, duration, messages	Workflow run history

A typical batch pipeline in SSIS looks like a SQL Server Agent job with steps in sequence:

Job: NIGHTLY_FINANCE_LOAD (runs at 01:00)
  Step 1: Run Extract_Source_Data.dtsx   (success → Step 2, fail → notify)
  Step 2: Run Validate_Records.dtsx      (success → Step 3, fail → notify)
  Step 3: Run Load_DW_Facts.dtsx         (success → Step 4, fail → notify)
  Step 4: Run Update_Audit_Table.sql     (success → end)

Azure Data Factory (Cloud Orchestration)¶

Customers who have moved SSIS to Azure commonly use Azure Data Factory (ADF) to trigger packages on Azure-SSIS IR. ADF acts as the orchestration layer — its pipelines call Execute SSIS Package activities that run packages on the IR.

ADF Concept	SSIS Equivalent	Databricks Equivalent
Pipeline	SQL Server Agent Job	Databricks Workflow
Activity	Agent Job Step	Workflow Task
Trigger	Agent Schedule	Workflow Schedule
Execute SSIS Package activity	Agent step running dtexec	Databricks notebook task
Linked Service	Connection Manager	Databricks secret-backed connection
Integration Runtime	SSIS server / Azure-SSIS IR	Databricks cluster

SA Tip: Customers on ADF + Azure-SSIS IR are in a half-migrated state — they've moved the scheduling layer to cloud-native but kept the ETL logic in SSIS packages. This is a compelling conversation: they're already paying for ADF and IR; migrating the packages to Databricks notebooks replaces the IR cost and gives them a genuinely cloud-native, scalable platform.

External Schedulers¶

Some large enterprises use Control-M, Autosys, or IBM Workload Scheduler to trigger SQL Server Agent jobs or ADF pipelines. The same principle from Ab Initio applies here: if an external scheduler is in play, the migration involves two layers — replacing SSIS with Databricks, and replacing or integrating the external scheduler trigger.

7. Metadata, Lineage, and Impact Analysis¶

SSISDB as the Inventory Source¶

The SSIS Catalog (SSISDB) is the most reliable inventory source for deployed packages. It stores: - All deployed projects and packages (binary .ispac and individual .dtsx) - Execution history — every package run, its start/end time, status, and row counts - Environment references — which environments exist and which variables they override - Validation reports — package validation results at deploy time

Useful SSISDB queries for inventory:

-- List all deployed packages
SELECT p.name AS project_name, pkg.name AS package_name, pkg.description
FROM catalog.projects p
JOIN catalog.packages pkg ON p.project_id = pkg.project_id
ORDER BY p.name, pkg.name;

-- Find packages that have actually run in the last 90 days
SELECT DISTINCT package_name, COUNT(*) AS executions,
       MAX(start_time) AS last_run
FROM catalog.executions
WHERE start_time > DATEADD(day, -90, GETDATE())
GROUP BY package_name
ORDER BY last_run DESC;

-- Find packages that have NEVER run (candidates for decommission)
SELECT pkg.name AS package_name
FROM catalog.packages pkg
LEFT JOIN catalog.executions ex ON pkg.name = ex.package_name
WHERE ex.execution_id IS NULL;

SA Tip: The "never run" query is always revealing. Large SSIS estates commonly have 20–40% of packages that were built but never deployed or ran in production. Filtering these out before scoping migration saves significant effort and cost.

Lineage Limitations¶

SSIS has no native data lineage capability. The platform tracks job execution history (did the package succeed?) but does not track column-level or table-level data lineage (where did this data come from and where does it go?).

Approaches customers use to compensate:

Approach	What It Captures	Limitations
Package XML parsing	Source/destination connections within each package	Requires custom tooling; misses stored procedure logic
SQL Server Audit / Extended Events	Which tables were read/written at runtime	Infrastructure-level, not ETL-logic-level
Third-party lineage tools	Manta, WhereScape, MANTA Data Lineage	Expensive; results vary by package complexity
Manual documentation	Spreadsheets maintained by the team	Almost always out of date

SA Tip: Don't assume lineage documentation exists or is current. The most reliable approach is to parse the .dtsx XML directly — extract all OLE DB Sources (connection + SQL query), all OLE DB Destinations (table names), and all Execute SQL Tasks. This gives you a first-pass data flow map without needing any documentation from the customer.

Impact Analysis Without EME¶

Unlike Ab Initio, SSIS has no built-in impact analysis tool. To understand dependencies:

Package dependencies: Search for Execute Package Task entries in .dtsx XML to find parent-child package relationships
Shared connection managers: Project-level .conmgr files are referenced by multiple packages — changing a connection manager affects all packages in the project
Shared parameters: Project-level .params files are consumed by multiple packages — parameter changes have wide scope
SQL object dependencies: sys.sql_expression_dependencies in SQL Server can show which stored procedures and views are called from Execute SQL Tasks

8. Data Quality¶

Data Quality in SSIS Packages¶

SSIS has no dedicated "data quality layer" — quality checks are implemented as logic embedded directly in packages, using standard components in creative ways.

Common data quality patterns in SSIS:

Pattern	SSIS Implementation	Databricks Equivalent
Null / required field check	Conditional Split on `ISNULL([column])`	DLT `expect_or_drop()`
Range / domain validation	Derived Column + Conditional Split	DLT `expect()` with constraint expression
Referential integrity check	Lookup component (no match → error output)	DLT `expect()` or left-anti join
Duplicate detection	Sort + Aggregate to find duplicates	DLT / Great Expectations
Fuzzy matching / deduplication	Fuzzy Lookup, Fuzzy Grouping components	Splink / custom similarity logic
Row count validation	Row Count component + Execute SQL Task comparison	DLT row count expectations
Checksum / hash validation	Script Component computing hash	Python hashlib in notebook

The Error Output Pattern¶

Every SSIS Data Flow component has an optional Error Output — a second output port that catches rows that fail processing (type cast failure, truncation, constraint violation). Developers route error rows to: - A flat file for manual review - An error logging table in SQL Server - A separate "exception" processing package

Migration relevance: Error output routing is business logic, not infrastructure. During inventory, document every component that has an active Error Output wired to a destination — these represent exception handling workflows that must be preserved in the migrated pipeline.

Data Quality Services (DQS)¶

Microsoft Data Quality Services is a separate SQL Server feature for knowledge-based data cleansing. It is rarely used in practice — most SSIS estates use inline Conditional Split / Lookup patterns instead of DQS. If a customer mentions DQS: - It requires a separate SQL Server DQS installation - Cleansing rules are stored in a DQS "knowledge base" — a proprietary store that must be re-implemented in Databricks (Great Expectations, custom Python, or a third-party DQ tool)

9. SSIS File Formats Reference¶

When you walk into a customer's SSIS environment, you will encounter a specific set of file types. Knowing what each one is and what it means for migration is essential for artifact inventory.

`.dtsx` — SSIS Package¶

The .dtsx file is the core artifact of every SSIS estate. It defines a single ETL package — both the Control Flow (workflow logic) and the Data Flow (data pipeline). It is an XML file, making it technically human-readable, but the XML is verbose and generated by Visual Studio.

Property	Detail
Created by	Visual Studio / SSDT — developers build packages visually
Stored in	SSISDB (deployed), `.dtsproj` project folder (source), file system (legacy package deployment)
Contains	All tasks, components, variables, expressions, connection manager references, event handlers, and configurations for one package
Human-readable?	Technically yes (XML), but verbose and not meant to be edited by hand
Migration target	Each `.dtsx` is a migration unit — maps to a Databricks notebook, Python script, or DLT pipeline

Example snippet (OLE DB Source inside a Data Flow):

<component name="OLE DB Source" componentClassID="Microsoft.OLEDBSource">
  <properties>
    <property name="SqlCommand">
      SELECT customer_id, name, balance FROM dbo.customers WHERE active = 1
    </property>
  </properties>
  <connections>
    <connection connectionManagerID="{DW-Connection}" />
  </connections>
</component>

SA Tip: Parsing .dtsx XML is the most reliable way to inventory SSIS logic at scale — extract all SQL queries in SqlCommand properties and all table names in OpenRowset properties across all packages. Tools like BIML, custom PowerShell, or the open-source ssis-parser project can automate this. The count of distinct source tables and destination tables from this parse is your data asset inventory.

`.dtproj` — SSIS Project File¶

The .dtproj file is the Visual Studio project definition. It lists all packages in the project, project-level connection managers, parameters, and configuration references. It is the top-level organizing unit for a group of related packages.

Property	Detail
Created by	Visual Studio when a new Integration Services project is created
Stored in	Source control alongside `.dtsx` files
Contains	References to all `.dtsx` packages in the project, project parameters file, project-level connection managers
Human-readable?	Yes — XML format
Migration target	Maps to a Databricks project folder / Git repository; project structure informs Workflow organization

SA Tip: Count the number of .dtproj files to understand how many logical project boundaries exist. Large estates often have one project per subject area (Finance_ETL, HR_ETL, Operations_ETL) — these become natural migration wave boundaries.

`.params` — Project Parameters File¶

The .params file defines project-level parameters — named values that can be passed in at execution time to configure packages without modifying the package itself. Parameters include connection strings, file paths, run dates, batch IDs, and environment-specific values.

Property	Detail
Created by	Visual Studio — developers add parameters when packages need runtime configuration
Stored in	Root of the SSDT project folder, alongside `.dtproj`
Contains	Parameter definitions: name, data type, default value, whether it is sensitive (encrypted)
Human-readable?	Yes — XML format
Migration target	Maps to Databricks Workflow job parameters, Databricks secrets (for sensitive values), or Delta table configuration entries

Example snippet:

<SSIS:Parameter SSIS:Name="RunDate">
  <SSIS:Properties>
    <SSIS:Property SSIS:Name="DataType">7</SSIS:Property>
    <SSIS:Property SSIS:Name="Value">2026-01-01</SSIS:Property>
    <SSIS:Property SSIS:Name="Sensitive">0</SSIS:Property>
  </SSIS:Properties>
</SSIS:Parameter>

SA Tip: Parameters marked Sensitive="1" are encrypted in the SSISDB. You cannot read their values without appropriate SQL Server permissions. Identify these early — they represent credentials and secrets that need to be migrated to Databricks Secrets / Azure Key Vault with proper access controls.

`.conmgr` — Project-Level Connection Manager¶

The .conmgr file defines a shared database or file connection that multiple packages in a project can reference. Project-level connection managers avoid repeating connection strings in every package.

Property	Detail
Created by	Visual Studio when developers promote a package-level connection to project scope
Stored in	`Connection Managers/` folder inside the SSDT project
Contains	Connection type (OLE DB, ADO.NET, flat file, etc.), connection string, provider, authentication mode
Human-readable?	Yes — XML format
Migration target	Maps to Databricks cluster environment variables, secrets, or Unity Catalog external locations

SA Tip: Count distinct connection managers to understand how many distinct data systems the SSIS estate touches. Each unique Data Source in the OLE DB connection strings represents a separate source or target system — this is your integration surface map.

`.ispac` — SSIS Deployment Package¶

The .ispac file is the compiled, deployable artifact for a project — a ZIP archive containing all packages, parameters, and connection managers for a project. It is created by Visual Studio's Build process and deployed to SSISDB using SSMS or the SSIS Deployment Wizard.

Property	Detail
Created by	Visual Studio Build output
Stored in	Build output directory; CI/CD artifact store; sometimes archived in file shares
Contains	All `.dtsx` packages, `.params`, `.conmgr` files for the project — compressed and signed
Human-readable?	No — binary ZIP; can be unzipped to access constituent XML files
Migration target	Not directly — extract constituent `.dtsx` files for analysis

SA Tip: If a customer doesn't have Visual Studio source projects or Git history, .ispac files are the next best source for package extraction. Unzip the .ispac and you have all the package XML. This is often the fastest path to a complete package inventory when source control is absent.

`.sln` — Visual Studio Solution File¶

The .sln file is the top-level Visual Studio solution container — it groups one or more .dtproj SSIS projects together. It is a Visual Studio artifact, not an SSIS-specific format.

Property	Detail
Created by	Visual Studio when a solution is created
Stored in	Source control root
Contains	References to `.dtproj` project files and their relative paths
Human-readable?	Yes — plain text
Migration target	Organizational artifact only — maps to a Databricks workspace folder hierarchy

File Formats Quick Reference¶

Format	Extension	Migration Priority	Human-Readable?	Migration Target
SSIS Package	`.dtsx`	Critical	Yes (XML)	Databricks notebook / DLT pipeline
SSIS Project	`.dtproj`	High	Yes (XML)	Workspace folder / repo structure
Project Parameters	`.params`	High	Yes (XML)	Workflow job parameters / Databricks Secrets
Connection Manager	`.conmgr`	High	Yes (XML)	Databricks cluster config / Secrets
Deployment Package	`.ispac`	Medium	No (ZIP)	Source for package extraction if no Git
Visual Studio Solution	`.sln`	Low	Yes	Organizational reference only

10. Migration Assessment and Artifact Inventory¶

Step 1 — Establish the Package Inventory¶

The first deliverable in any SSIS migration assessment is a complete package inventory. Sources to query:

SSISDB — deployed packages; query catalog.packages for the authoritative deployed list
Git / SSDT projects — source packages; may differ from what's deployed
File system / MSDB — legacy Package Deployment Model packages; check msdb.dbo.sysdtspackages90
Build artifacts — .ispac files in CI/CD or file shares

Combine and deduplicate these sources to build your inventory. For each package, capture: - Package name and project - Last execution date (from catalog.executions) - Execution count in the last 90 days - Average and max runtime - Success/failure rate

Step 2 — Classify by Complexity¶

Score each package by migration complexity:

Complexity Factor	Low (1 pt)	Medium (2 pts)	High (3 pts)
Data Flow tasks	1–2	3–5	6+
Script Tasks / Components	None	1–2	3+
Source / destination types	SQL Server only	2–3 distinct systems	4+ or includes mainframe/VSAM
Conditional logic (Conditional Splits, precedence expressions)	Minimal	Moderate	Complex branching
Fuzzy Lookup / Fuzzy Grouping	None	Present	Central to pipeline
Foreach Loop / dynamic file processing	None	Present	Nested loops
Execute Package Task calls	None	1–3 sub-packages	4+ sub-packages
Expressions with custom date/string logic	None	Simple	Complex

Score interpretation:

Total Score	Complexity	Migration Approach
1–5	Simple	Direct rewrite in notebook — 0.5–1 day
6–10	Medium	Structured rewrite with testing — 2–5 days
11–15	Complex	Requires architect involvement — 1–2 weeks
16+	High	Full decomposition and redesign — 2–4 weeks

Step 3 — Identify Risk Areas¶

Script Tasks and Script Components — the highest migration risk. Each one contains arbitrary C# or VB.NET code. There is no automated translation path. Every Script Task/Component must be read, understood, and rewritten.

SA Tip: Ask the customer if their Script Tasks/Components have unit tests. The answer is almost always no. This means the only validation is behavioral testing against known data. Build regression test datasets before migration starts.

Stored Procedure Logic — SSIS packages that call Execute SQL Tasks with complex stored procedures push logic into SQL Server, not into the package. These stored procedures are often not inventoried as SSIS artifacts but are part of the ETL pipeline. They need to be inventoried and migrated too.

Foreach Loop over file directories — packages that loop over incoming file batches and process each file individually. These patterns map to Databricks Auto Loader but require understanding the file naming convention, trigger logic, and downstream sequencing.

Mixed encoding and flat files — especially Excel sources. These are fragile in SSIS and don't have direct equivalents; they require data ingestion redesign.

Cross-package parameters and shared connection managers — changing one shared connection manager can break many packages simultaneously. Map these dependencies before any infrastructure changes.

Complexity Distribution Benchmark¶

In a typical mid-size enterprise SSIS estate:

Complexity	Expected %	Notes
Never-run / decommissioned	20–40%	Filter these out — no migration needed
Simple (score 1–5)	30–40%	Quick wins for early migration waves
Medium (score 6–10)	20–30%	Bulk of migration effort
Complex (score 11+)	5–15%	Highest risk; plan carefully

11. Migration Mapping to Databricks¶

Building Blocks¶

SSIS Concept	Databricks Equivalent
Package	Databricks notebook / Python script / DLT pipeline
Project	Databricks workspace folder / Git repository
Control Flow	Databricks Workflow (task graph)
Data Flow Task	PySpark DataFrame transformations / DLT pipeline
Execute Package Task	Workflow task dependency (child workflow call)
SSISDB	Unity Catalog (metadata) + Workflow run history (execution)
Environment (DEV/QA/PROD)	Databricks environment-specific job configurations
Project Parameters	Databricks Workflow job parameters / Databricks Secrets
Connection Manager	Databricks external location / secret-backed connection
SSIS Catalog (deployment)	Databricks Asset Bundles (DABs)

Components and Transforms¶

SSIS Component	Databricks Equivalent
OLE DB Source / ADO.NET Source	`spark.read.jdbc()` or native Delta/Parquet read
Flat File Source	`spark.read.csv()` / `spark.read.text()`
Excel Source	`openpyxl` in Python notebook (pre-process to CSV/Parquet)
Derived Column	`df.withColumn()` / `SELECT col + expression AS new_col`
Data Conversion	`df.cast()` / SQL `CAST()`
Conditional Split	`df.filter()` + multiple output DataFrames / DLT `expect_or_drop`
Merge Join	`df.join()` / SQL `JOIN`
Lookup	`df.join(..., how="left")` + null filter / `broadcast()` join
Aggregate	`df.groupBy().agg()` / SQL `GROUP BY`
Sort	`df.orderBy()` — note: avoid in Spark unless necessary
Fuzzy Lookup / Fuzzy Grouping	Splink library / custom cosine similarity / Databricks ML
Script Component (transform)	Python UDF / pandas UDF registered in Unity Catalog
Row Count	`df.count()` / DLT metrics
Multicast	Multiple DataFrame references to same source DataFrame
OLE DB Destination	`df.write.jdbc()` or Delta merge
Flat File Destination	`df.write.csv()` / `df.write.parquet()`
Error Output routing	DLT quarantine pattern / explicit filter + write to error table

Orchestration¶

SSIS / SQL Agent Concept	Databricks Equivalent
SQL Server Agent Job	Databricks Workflow
Agent Job Step	Workflow Task
Agent Job Schedule	Workflow Schedule (cron)
Step success/failure branching	Workflow task `on_success` / `on_failure` dependencies
Agent Operator (email)	Workflow notification / webhook
Job run history	Workflow run history
dtexec.exe (command-line execution)	Databricks CLI job run / REST API
Azure Data Factory trigger	Databricks Workflow schedule / ADF linked notebook activity
SSIS Package parameter at runtime	Workflow job parameter override

Governance and Configuration¶

SSIS Concept	Databricks Equivalent
SSISDB environments (DEV/QA/PROD)	Databricks workspace environments + DABs targets
Environment variable overrides	Job parameters + Databricks Secrets
Sensitive parameters (encrypted)	Databricks Secrets / Azure Key Vault
Package protection level (EncryptSensitiveWithPassword)	Databricks Secrets — no password-based encryption in DABs
SSISDB execution log	Unity Catalog system tables / Workflow run history
SSMS deployment wizard	`databricks bundle deploy` (DABs)
Visual Studio project	Git repository + DABs bundle definition

Data Quality¶

SSIS Pattern	Databricks Equivalent
Conditional Split (null / range check)	Delta Live Tables `expect()` / `expect_or_drop()`
Lookup with no-match → error output	Left anti-join pattern in DLT
Fuzzy Lookup / Fuzzy Grouping	Splink / custom ML similarity
Row Count + comparison	DLT row count metric + expectation
Script Component (custom validation)	Python UDF with validation logic
Error output → error log table	DLT quarantine table pattern
Data Quality Services (DQS)	Great Expectations / custom Python DQ framework

What Doesn't Map Cleanly¶

SSIS Feature	Why It's Hard	Recommended Approach
Script Tasks / Components (C# / VB.NET)	No automated translation; arbitrary business logic	Manual rewrite to Python; treat each as a custom development task
Excel source (ACE driver)	32-bit Windows driver; no equivalent in Spark	Redesign: convert Excel to CSV or Parquet at ingest; use `openpyxl` for preprocessing
Fuzzy Lookup / Fuzzy Grouping	Probabilistic matching with no direct Spark equivalent	Splink library or custom similarity scoring; requires ML expertise
Package protection levels (password encryption)	SSIS encryption tied to Windows identity / passwords	Migrate secrets to Databricks Secrets / Azure Key Vault before migration
Real-time row-level Error Output routing	SSIS error outputs are per-row inline; DLT quarantine is batch	Redesign as DLT quarantine pattern or explicit filter-and-write
Foreach Loop with dynamic file discovery	File-arrival-driven patterns require redesign	Databricks Auto Loader for continuous file ingestion; event-based triggers via ADF
Linked Server calls inside Execute SQL Tasks	Linked Servers are SQL Server-specific and won't exist in target	Identify all linked server references; each is a separate source system requiring direct JDBC or connector
SSIS transactions (TransactionOption)	SSIS supports distributed transactions across components	Delta ACID transactions replace this; redesign pipeline boundaries around Delta merge operations
Windows Authentication (SSPI) to SQL Server	Windows identity-based auth has no equivalent in Databricks	Migrate to service principal / token authentication; update all connection managers

SA Tip: The combination of Script Components + Fuzzy matching + Excel sources in a single package is a reliable indicator of maximum migration complexity. If a customer has more than 10 packages with all three, flag it as a significant custom development engagement — not a lift-and-shift.