Informatica — SA Migration Guide¶
Purpose: Give a Solution Architect enough depth to assess an Informatica estate, understand its moving parts, and map a migration path to Databricks.
This is not a developer guide. You won't be building Informatica mappings. You will be walking customer sites, reviewing architecture diagrams, asking the right questions, and scoping what it takes to move to a modern lakehouse platform.
Architecture Diagrams¶
Informatica Platform Architecture¶
How the Informatica product suite fits together — from developer tooling through runtime execution to governance.
flowchart TD
subgraph DEV["Developer Layer"]
PC_CLIENT["PowerCenter Designer\nBuild mappings visually"]
IDMC_UI["IDMC Designer\n(Cloud) — browser-based mapping editor"]
WF_MGR["Workflow Manager\nDefine sessions, workflows, worklets"]
end
subgraph REPO["Repository Layer"]
REP_SVC["Repository Service\nCentral metadata store — mappings,\nsessions, workflows, connections"]
VERSION["Version Control\n(optional SVN/Perforce integration)"]
REP_SVC <--> VERSION
end
DEV -- check in / check out --> REP_SVC
subgraph RUNTIME["Runtime Layer"]
IS["Integration Service\n(DTM — Data Transformation Manager)\nExecutes sessions in parallel partitions"]
GRID["Informatica Grid\nScale-out across multiple IS nodes"]
IS --> GRID
end
REP_SVC -- deploys session configs --> RUNTIME
subgraph ORCH["Orchestration Layer"]
WF_SVC["Workflow Service\nExecutes workflows, chains sessions"]
MONITOR["Workflow Monitor\nReal-time job status and history UI"]
EXTERNAL["External Scheduler\nControl-M / Autosys / UC4"]
EXTERNAL -- triggers workflow --> WF_SVC
WF_SVC --> MONITOR
end
WF_SVC -- triggers sessions via --> RUNTIME
subgraph GOV["Governance & Quality"]
IDQ["Informatica Data Quality (IDQ)\nProfiling, rules, scorecards"]
MM["Metadata Manager\nLineage, impact analysis, catalog"]
MDM["MDM Hub\nMaster data management"]
end
RUNTIME -- dataset samples --> IDQ
REP_SVC -- metadata feed --> MM
subgraph CLOUD["Cloud (IDMC)"]
CAI["Cloud Application Integration"]
CDI["Cloud Data Integration\n(CDI) — replaces PowerCenter SaaS"]
CDGC["Cloud Data Governance & Catalog\n(CDGC) — replaces Metadata Manager"]
end
Informatica as ETL — Data Flow Between Systems¶
How Informatica sits between source systems and targets in a typical enterprise data pipeline.
flowchart LR
subgraph SOURCES["Source Systems"]
DB[("Relational DB\nOracle / SQL Server / DB2")]
MF["Mainframe / VSAM\nvia PowerExchange CDC"]
FILES["Flat Files\nDelimited / Fixed-width"]
SAAS["SaaS Apps\nSalesforce / SAP / Workday"]
end
subgraph INFA["Informatica ETL Layer"]
direction TB
SRC_QUAL["Source Qualifier\nSQL override, filter, join at source"]
TRANSFORMS["Transformation Logic\nExpression · Lookup · Joiner · Router\nAggregator · Sorter · Update Strategy"]
STAGING[("Staging / Cache\nLookup cache files\nSorter temp files")]
TGT["Target Definition\nRelational / File / Cloud"]
SRC_QUAL --> TRANSFORMS
TRANSFORMS -- cache/temp --> STAGING
STAGING --> TRANSFORMS
TRANSFORMS --> TGT
end
subgraph TARGETS["Target Systems"]
DW[("Data Warehouse\nTeradata / Netezza / Snowflake")]
DATALAKE["Data Lake\nS3 / ADLS / HDFS"]
RPTDB[("Reporting DB\nSQL Server / Oracle")]
DOWNSTREAM["Downstream Apps\nvia file or DB write"]
end
subgraph ORCH["Orchestration"]
WF["Workflow\nSessions chained with\ndependencies and decisions"]
end
SOURCES --> SRC_QUAL
TGT --> TARGETS
WF -. schedules .-> INFA
Sections¶
- Ecosystem Overview
- Mappings and Transformations — The Core Building Block
- Data Formats and Schema
- Parallelism and Session Partitioning
- Project Structure and Repository
- Orchestration: Workflows and Sessions
- Metadata, Lineage, and Impact Analysis
- Data Quality with IDQ
- Informatica File Formats Reference
- Migration Assessment and Artifact Inventory
- Migration Mapping to Databricks
1. Ecosystem Overview¶
What Is Informatica?¶
Informatica is the market-leading enterprise data integration platform. It has been the de facto standard for large-enterprise ETL since the late 1990s and is embedded in financial services, healthcare, retail, and government organizations globally. Customers choose it because it covers an enormous range of data integration scenarios — batch ETL, CDC, data quality, MDM, and API integration — from a single vendor.
Unlike cloud-native tools, Informatica PowerCenter is:
- On-premises first — designed for dedicated server deployments, increasingly run on cloud VMs or hybrid
- License-heavy and expensive — PowerCenter licenses are sold by core, connector pack, and product module; large customers pay millions annually
- Broad but complex — the product suite is vast, which means configuration sprawl and steep learning curves for new developers
- Transitioning to IDMC — Informatica is actively pushing customers from PowerCenter (on-prem) to its cloud SaaS platform, IDMC (Intelligent Data Management Cloud)
The Informatica Product Suite¶
Informatica is a suite of products. Knowing which ones a customer uses determines migration scope.
| Product | What It Does | Migration Relevance |
|---|---|---|
| PowerCenter | Core on-premises ETL engine — the primary migration target for most enterprises | High — all transformation logic, sessions, workflows live here |
| IDMC / CDI | Cloud SaaS replacement for PowerCenter — browser-based, serverless | Medium — customers partially migrated may have assets in both |
| PowerExchange | CDC and bulk-load connector for mainframe, SAP, databases | High — real-time/CDC pipelines require special migration handling |
| Informatica Data Quality (IDQ) | Profiling, cleansing rules, scorecards, address validation | Medium — quality rules embedded in pipelines must be migrated |
| Metadata Manager | Enterprise lineage, data catalog, impact analysis | High — primary tool for inventory and dependency mapping |
| MDM Hub | Master data management — golden record creation | Low-to-medium — usually out of scope for lakehouse migration |
| Axon / CDGC | Cloud data governance and catalog (successor to Metadata Manager) | Medium — lineage and catalog metadata relevant to migration planning |
| Data Replication / CDC | Real-time change data capture between systems | High if customer uses it — maps to Databricks DLT + CDC |
SA Tip: Many customers use PowerCenter for batch ETL and PowerExchange for CDC, but treat them as separate teams. Ask explicitly whether there is a CDC workstream — it often has different owners and a longer migration timeline than batch pipelines.
Why Customers Want to Migrate¶
| Driver | What It Means for the Engagement |
|---|---|
| License cost | PowerCenter license + maintenance + connector packs represent one of the largest data platform line items |
| Vendor push to IDMC | Informatica is end-of-lifing PowerCenter support — some customers are migrating to Databricks instead of IDMC |
| Talent scarcity | Informatica developers are increasingly rare; customers want Python/SQL-native platforms |
| Cloud strategy | Board mandate to exit on-prem data centers |
| Performance limits | PowerCenter scales vertically or via grid — customers hitting ceiling on large volumes |
| Modern architecture | Customers want streaming, Delta Lake, and unified batch/streaming — capabilities PowerCenter lacks natively |
SA Tip: "Informatica is pushing us to IDMC and we don't want to pay for it twice" is one of the most common migration triggers in 2025–2026. Databricks becomes the alternative to IDMC — position it as the open, Python-native path rather than another proprietary SaaS lock-in.
Key Discovery Questions¶
Before scoping a migration, ask:
- How many mappings are in active production use? (vs. total mappings in the repository — there is always dead code)
- Is this PowerCenter, IDMC, or both? Which version of PowerCenter?
- What connector packs are in use? (Oracle, SAP, Salesforce, mainframe via PowerExchange?)
- Is PowerExchange / CDC in scope? Which sources?
- How is orchestration done — Informatica Workflows, external scheduler (Control-M, Autosys), or both?
- Are there reusable mapplets or shared Lookup caches that many mappings depend on?
- Is IDQ integrated into pipelines, or is data quality handled separately?
- What does the repository look like — one central repository or multiple folder-per-team structures?
- Are there custom transformations written in Java or C (Custom / External Procedure transformations)?
- What are the SLAs for critical pipelines and what are the batch window constraints?
2. Mappings and Transformations — The Core Building Block¶
The Mapping¶
In Informatica, the mapping is the fundamental unit of transformation logic — equivalent to a pipeline or ETL job. A mapping is a directed dataflow: data enters from one or more source definitions, passes through a series of transformation objects, and exits to one or more target definitions.
Developers build mappings visually in the PowerCenter Designer by dragging transformation objects onto a canvas and connecting them with links. Each link carries a row stream between transformations.
A single Informatica estate can have thousands of mappings — many of them legacy, redundant, or cloned from each other with minor differences.
Transformations¶
A transformation is a single processing step within a mapping. Informatica ships with a large library of built-in transformations, and customers can write custom ones. Transformations are the Informatica equivalent of Spark DataFrame operations.
Core transformation categories:
| Category | Transformation | What It Does |
|---|---|---|
| Source | Source Qualifier | Reads from source; SQL override for filtering/joining at DB level |
| Expression | Expression | Field-level calculations, conditionals, type conversions |
| Filter | Filter | Drops rows that don't meet a condition |
| Routing | Router | Splits rows into multiple output groups by condition |
| Join | Joiner | Joins two row streams (one must be a master, one detail) |
| Lookup | Lookup | Enriches rows by looking up values from a DB table or flat file |
| Aggregate | Aggregator | Group-by aggregations — sum, count, avg, max, min |
| Sort | Sorter | Orders rows; requires significant temp disk for large volumes |
| Deduplicate | n/a (done via Sorter + Expression logic) | No native dedup transform — customers implement manually |
| Update logic | Update Strategy | Marks rows as insert / update / delete / reject for target write |
| Sequence | Sequence Generator | Generates surrogate keys |
| Stored Proc | Stored Procedure | Calls a DB stored procedure inline |
| Custom | Custom / External Procedure | User-written C or Java logic — high migration risk |
Mapplets — Reusable Sub-Mappings¶
A mapplet is a reusable transformation subgraph — a group of transformations packaged as a single reusable object. Mapplets are the Informatica equivalent of a shared library or helper function.
SA Tip: Mapplets that are used across many mappings are high-priority migration dependencies — the Databricks equivalent (a shared PySpark function or Unity Catalog SQL function) must be built and validated before any downstream mappings can be migrated.
Sessions and Workflows¶
A mapping is pure transformation logic with no execution context. To run a mapping, you wrap it in a Session task, which provides runtime configuration: connections, parameter files, partitioning, error handling. Sessions are then chained inside Workflows — see Section 6.
3. Data Formats and Schema¶
How Informatica Defines Schema¶
Informatica uses source and target definitions stored in the repository to describe schema. Unlike Ab Initio's file-based DML, Informatica schema is stored relationally in the repository database and surfaced through the Designer UI.
Key schema objects:
| Object | What It Represents | Migration Relevance |
|---|---|---|
| Source Definition | Schema of the input — DB table, flat file, XML, WSDL | Must be mapped to a Databricks source (Delta table, file, API) |
| Target Definition | Schema of the output | Must be mapped to a Databricks target table or file |
| Transformation port | A column in a transformation's input or output | Maps to a DataFrame column in PySpark |
| Metadata Extension | Custom annotations on objects | Useful for tagging migration status |
Flat File Handling¶
Informatica handles flat files through the Flat File Definition — a schema descriptor stored in the repository that defines delimiter, row format, field order, and data types.
| File Type | Informatica Handling | Migration Note |
|---|---|---|
| Delimited (CSV, pipe) | Flat File Source/Target with delimiter config | Directly readable in Databricks with spark.read.csv |
| Fixed-width | Flat File Definition with field positions | Requires schema definition in PySpark — more effort |
| XML | XML Source Qualifier with XPath | Maps to spark.read.xml or custom parsing |
| JSON | Supported in newer versions / IDMC | Maps to spark.read.json |
| COBOL copybook | Via PowerExchange or flat file with copybook schema | Requires mainframe offload — high effort |
SA Tip: Fixed-width files with complex COBOL-style layouts are common in financial services Informatica estates. These require explicit schema definition in PySpark and are among the highest-effort individual sources to migrate. Count them during inventory.
Parameter Files¶
Informatica uses parameter files (.par) to externalize runtime values — database connection strings, file paths, date ranges, environment flags. These are flat text files read at session startup.
Parameters are used at three scopes:
| Scope | Syntax | What It Controls |
|---|---|---|
| Mapping parameter | $$PARAM_NAME |
Values used inside transformation expressions |
| Session parameter | $DBConnection, $InputFile |
Connection objects, file paths |
| Workflow variable | $$WF_VAR |
Values passed between tasks in a workflow |
Migration relevance: Parameter files are the Informatica equivalent of configuration — they must be translated to Databricks job parameters, Databricks Secrets, or environment-specific configs in Databricks Asset Bundles. Customers often have dozens of environment-specific parameter files (dev, QA, prod) that must all be accounted for.
4. Parallelism and Session Partitioning¶
How Informatica Achieves Parallelism¶
Informatica's runtime engine — the DTM (Data Transformation Manager) — executes mappings as sessions. Parallelism is achieved through session partitioning: the DTM splits input data into partitions and processes them concurrently across threads (or across nodes on an Informatica Grid).
Unlike Spark, partitioning in Informatica is manually configured per session — the developer or DBA must specify partition type, partition count, and partition key for each session.
Partition Types¶
| Partition Type | What It Does | Databricks Equivalent |
|---|---|---|
| Pass-through | Rows flow as-is, no redistribution | No shuffle — default Spark behavior |
| Round-robin | Distributes rows evenly across partitions | repartition(n) |
| Hash | Routes rows with matching key to same partition | repartition(col) |
| Key range | Splits rows by value range | Range partitioning |
| Database partitioning | Uses DB-native partitioning (Oracle partition pruning) | Spark partition pushdown |
Informatica Grid¶
The Informatica Grid distributes session processing across multiple Integration Service nodes. Each node handles a subset of partitions. Grid is the scale-out mechanism for very large volumes.
Migration relevance: Customers running Grid configurations are processing at significant scale. When migrating to Databricks, the cluster size and auto-scaling configuration must be sized to match or exceed the throughput of the Grid setup. Ask for session statistics (rows/sec, peak throughput) from the Workflow Monitor to establish a baseline.
Lookup Caching¶
The Lookup transformation is one of Informatica's most-used and most-expensive transforms. It can operate in two modes:
| Mode | Behavior | Migration Note |
|---|---|---|
| Cached lookup | Reads the entire lookup table into memory at session start | Maps to Spark broadcast join — must check lookup table size |
| Uncached lookup | Issues a DB query per row — extremely slow at scale | Should be rewritten as a join in Databricks, not replicated as-is |
| Dynamic cache | Cache is updated as new records are inserted | Complex — maps to Delta merge pattern with stateful logic |
SA Tip: Uncached lookups in production are a performance antipattern in Informatica — but they exist in nearly every large estate. When you find them, don't replicate the pattern; flag them as optimization opportunities in the migration. Rewriting as joins on Databricks almost always yields a 10–100x speedup.
5. Project Structure and Repository¶
The Repository¶
The PowerCenter Repository is a relational database (Oracle, SQL Server, or DB2) that stores all metadata: source/target definitions, mappings, mapplets, sessions, workflows, connections, and version history. It is the authoritative inventory of the Informatica estate.
Access the repository through the Repository Service — a managed service that brokers all client connections. Direct database queries to the repository are possible (and useful for inventory scripting) but officially unsupported by Informatica.
Folder Structure¶
Within the repository, artifacts are organized into folders — the primary organizational unit. Teams typically own one or more folders.
Repository: PROD_REPO
├── Folder: Finance_ETL
│ ├── Sources
│ ├── Targets
│ ├── Transformations (including Mapplets)
│ ├── Mappings
│ └── Workflows / Sessions
├── Folder: HR_Integration
└── Folder: Shared_Objects ← cross-team reusable objects
| Concept | What It Is | Databricks Equivalent |
|---|---|---|
| Repository | Central metadata store for all artifacts | Databricks workspace (Unity Catalog as metadata store) |
| Folder | Team or domain grouping of all related artifacts | Databricks workspace folder / catalog schema |
| Shared folder | Common reusable objects (Lookups, Mapplets) referenced cross-folder | Unity Catalog shared schema |
| Label | Version tag applied to a folder snapshot | Git tag |
| Deployment group | A set of objects exported together for promotion | CI/CD artifact bundle |
Version Control and Promotion¶
PowerCenter has a built-in version control capability (check-in/check-out of objects with history). However, it does not integrate natively with Git — version history lives in the repository, not in source code.
Promotion across environments (DEV → QA → PROD) is done via:
- Export/Import — objects exported as XML (
.xmlexport files) and imported into the target environment repository - Deployment groups — a named set of objects that can be promoted together
SA Tip: Many customers have informal or ad-hoc promotion processes — "export this mapping, email it to the QA team." Ask specifically how promotion is done and who controls it. If the answer is not a repeatable automated process, migration to Databricks Asset Bundles with proper CI/CD is an immediate win to highlight.
6. Orchestration: Workflows and Sessions¶
The Orchestration Model¶
Informatica orchestration has three layers:
| Layer | Object | What It Does |
|---|---|---|
| Workflow | Workflow | Top-level container — chains tasks (sessions, decisions, emails, commands) with dependencies |
| Worklet | Worklet | Reusable sub-workflow — a group of tasks packaged for reuse across workflows |
| Session | Session task | Executes a single mapping with runtime configuration |
A Workflow is the Informatica equivalent of a Databricks Workflow or an Airflow DAG.
Task Types in Workflows¶
| Task Type | What It Does | Databricks Equivalent |
|---|---|---|
| Session | Runs a mapping (the main work unit) | Databricks Workflow notebook/JAR task |
| Command | Runs a shell command or script | Databricks Workflow notebook task or shell script task |
| Decision | Evaluates a condition and branches | Conditional task dependency in Workflow |
| Sends an email notification | Databricks Workflow alert / webhook | |
| Event Wait | Pauses until a file or event arrives | Databricks file-arrival trigger |
| Timer | Waits a fixed time interval | Databricks schedule with offset |
| Assignment | Sets a workflow variable | Passing parameters between Workflow tasks |
| Worklet | Embeds a reusable sub-workflow | Databricks Workflow with nested task groups |
Session Configuration¶
A session wraps a mapping with: - Connection objects — named database or file connections (connection details externalized from mapping logic) - Parameter file — runtime values for this execution - Partition configuration — how many partitions, which partition type, which key - Error handling — stop-on-error thresholds, bad file output path - Pre/post session SQL — SQL commands run before/after the session (often used for truncate, index rebuild, stats update)
SA Tip: Pre/post session SQL is frequently where critical logic hides — table truncates, index drops, statistics updates, audit inserts. These must be inventoried as part of session migration, not just the mapping logic. A session that "just loads data" often has five SQL commands doing equally important work around it.
External Schedulers¶
Many Informatica environments delegate top-level scheduling to an enterprise scheduler — BMC Control-M, Tidal, Autosys, or IBM Workload Scheduler. In this pattern:
- The external scheduler handles time-based triggers and cross-system dependencies (e.g., "wait for the SAP extract file to land in the FTP folder")
- Informatica Workflows handle intra-pipeline sequencing
Migration relevance: If an external scheduler is in scope, the migration involves two integration points: replacing Informatica Workflows with Databricks Workflows, and integrating Databricks with the external scheduler (or replacing it with Databricks' built-in scheduling). Confirm with the customer which schedulers are in play — they may have different owners and different migration timelines.
7. Metadata, Lineage, and Impact Analysis¶
The Repository as a Metadata Store¶
The PowerCenter repository is the richest metadata source available during assessment. Because it stores all mapping logic relationally, it can be queried directly via SQL to produce inventory reports. Informatica also ships a set of repository views (tables prefixed with REP_) that expose metadata in a structured form.
Key repository views for migration inventory:
| View | What It Contains | Use During Migration |
|---|---|---|
REP_ALL_MAPPINGS |
All mappings with folder, name, description, last-modified | Top-level migration scope inventory |
REP_ALL_TRANSFORMS |
All transformation instances and their types | Identify complex transforms and custom objects |
REP_SESS_LOG |
Session run history with row counts and durations | Identify active pipelines, volume sizing |
REP_WORKFLOWS |
All workflows and their status | Orchestration scope inventory |
REP_SUBJECT_AREA |
Folder definitions | Organize inventory by team/domain |
REP_SRC_FILES |
Source file definitions | Identify flat file sources |
REP_TGT_FILES |
Target file definitions | Identify flat file targets |
SA Tip: Query
REP_SESS_LOGfiltered to the last 90 days to distinguish active mappings from the full repository. Many Informatica estates have 50% or more legacy mappings that haven't run in years. Migrating only active pipelines dramatically reduces scope.
Metadata Manager¶
Informatica Metadata Manager is a separate product that sits above the repository. It scans metadata from multiple sources (PowerCenter, databases, BI tools, Hadoop) and builds an enterprise lineage graph.
| Metadata Manager Feature | Migration Use |
|---|---|
| Cross-system lineage | Trace a field from source DB → Informatica mapping → target DW → BI report |
| Impact analysis | "What breaks if I change this source table column?" |
| Technical lineage | Field-level lineage within PowerCenter mappings |
| Business glossary | Link technical assets to business terms |
SA Tip: Metadata Manager is often licensed but underused. Ask whether it's active and whether lineage has been scanned recently. If it is current, export the lineage graph — it is the fastest way to identify cross-system dependencies and scope the migration boundary.
Field-Level Lineage¶
Within a mapping, field-level lineage can be traced manually or via repository queries: which source field, through which transformation expression, produces which target field. This is essential for validating that migrated pipelines produce identical output.
Common lineage gaps:
- Mappings that call Command tasks (shell scripts) — lineage breaks at the script boundary
- Pre/post session SQL — not captured in mapping lineage at all
- Dynamic connections where source/target is determined at runtime from parameter values
- Mappings that write to intermediate files read by other sessions (file-based handoffs between sessions within a workflow)
8. Data Quality with IDQ¶
What Informatica Data Quality Does¶
Informatica Data Quality (IDQ) is a separate product suite for profiling, cleansing, standardization, and matching. In enterprises that use it, IDQ rules are often embedded directly inside PowerCenter mappings via IDQ transformations, making data quality part of the ETL pipeline rather than a separate system.
IDQ Components in Mappings¶
| IDQ Component | What It Does | Databricks Equivalent |
|---|---|---|
| Data Processor | Parses and transforms complex formats (XML, SWIFT, EDI, COBOL) | Custom parsing logic / Databricks from_xml / external library |
| Parser | Breaks unstructured text into structured fields | Custom NLP/regex UDF |
| Standardizer | Normalizes values to canonical form (address, name) | Custom lookup/regex expression or Databricks partner DQ tool |
| Matcher | Fuzzy matching for deduplication | Databricks ML-based entity resolution or partner tool |
| Labeler | Classifies records by pattern (email, phone, SSN) | Custom regex classifier / Unity Catalog data classification |
| Exception | Routes bad records to an exceptions queue | Delta Live Tables expect() with quarantine |
| Scorecards | Aggregate DQ metrics across pipelines | Databricks DQ dashboards / Lakehouse Monitoring |
SA Tip: IDQ embedded in production mappings is business logic, not optional decoration. If a customer has IDQ transformations in their critical path, ask for the ruleset definitions — they must be migrated to Databricks (as DLT expectations, Great Expectations, or Soda) before the pipeline is considered equivalent. This is commonly underestimated in migration scoping.
Profiling¶
IDQ includes a Profiling workbench that analyzes datasets and generates statistics — null rates, value distributions, pattern frequencies, PK/FK violations. Use existing profiling outputs to:
- Establish a data quality baseline before migration begins
- Define acceptance criteria for post-migration validation
- Identify columns with high null rates or format inconsistencies that may cause issues during migration
9. Informatica File Formats Reference¶
When you walk into a customer's Informatica environment, you will encounter a specific set of file types. Knowing what each file is and what it means for migration is essential for artifact inventory.
.xml — Repository Export File¶
The .xml export file is the primary portability format for Informatica artifacts. When objects are exported from the repository (for backup, promotion, or migration), they are serialized as XML. This is the format used to move objects between repositories and environments.
| Property | Detail |
|---|---|
| Created by | Informatica Repository Manager — developers export objects manually or via pmrep CLI |
| Stored in | File system (not in repository) — typically on a shared drive or version control system |
| Contains | Complete definitions of mappings, sessions, workflows, source/target definitions, connections — everything needed to reconstruct objects in another repository |
| Human-readable? | Yes — XML text, but verbose and not intended for hand-editing |
| Migration target | The .xml export is the starting point for automated parsing during migration inventory — tools can extract mapping logic, transformation lists, and lineage from these files without needing live repository access |
SA Tip: Ask the customer to export all production folder XML files as a first deliverable. These files give you a complete offline inventory of the estate without needing direct repository database access — and they can be parsed programmatically to count transformation types, identify custom objects, and scope effort.
.par — Parameter File¶
The parameter file is a flat text file that externalizes runtime configuration values. Sessions read these files at startup to resolve $$PARAMETER and $SessionVariable references.
| Property | Detail |
|---|---|
| Created by | Developers or operations teams — hand-authored text files |
| Stored in | File system, typically in an environment-specific directory (e.g., /infa/params/prod/) |
| Contains | Name=value pairs for mapping parameters, session parameters, workflow variables, and connection overrides |
| Human-readable? | Yes — plain text key-value format |
| Migration target | Maps to Databricks job parameters, Databricks Secrets (for credentials), or environment-specific config files in Databricks Asset Bundles |
Example parameter file:
[session_name.mapping_name]
$$START_DATE=2024-01-01
$$END_DATE=2024-01-31
$DBConnection_Source=Oracle_Prod
$InputFile=/data/inbound/customers_20240101.csv
SA Tip: Count how many unique parameter files exist across environments and how they differ. Customers with a disciplined parameter file structure (one file per environment, clearly named) migrate cleanly to DAB configs. Customers with ad-hoc parameter files scattered across server directories are a significant configuration management risk.
.wf / Workflow and Session Definitions (in Repository)¶
Workflows and sessions are stored in the repository (not as standalone files) but are exported as part of the .xml export. They can also be inspected via the pmrep command-line interface.
| Property | Detail |
|---|---|
| Created by | Workflow Manager — developers configure sessions and wire them into workflows visually |
| Stored in | Repository database (exported as part of .xml) |
| Contains | Task definitions (sessions, commands, decisions), task dependencies, event triggers, scheduling, partition settings per session |
| Human-readable? | Via .xml export — yes, but verbose |
| Migration target | Each workflow maps to a Databricks Workflow; each session maps to a Databricks task (notebook or DLT pipeline) |
SA Tip: Workflows with many branching Decision tasks and Assignment tasks are complex orchestration logic — not just "run these sessions in order." These must be carefully mapped to Databricks Workflow conditional dependencies. A workflow with 50+ tasks is a significant migration effort unit on its own.
.bad — Bad Record File¶
The bad record file captures rows that fail a session's error threshold — records rejected at the target write stage. It is a flat file written by the DTM during session execution.
| Property | Detail |
|---|---|
| Created by | Integration Service — written automatically during session execution |
| Stored in | File system, path configured in session properties |
| Contains | Rejected rows in original format plus an indicator code |
| Human-readable? | Yes — same delimited format as the input |
| Migration target | Maps to a DLT quarantine table or a Delta "bad records" table with rejection reason column |
Lookup Cache Files (.idx / .dat)¶
When a Lookup transformation runs in cached mode, it writes the lookup table to disk as cache files (an index file .idx and a data file .dat) in the cache directory.
| Property | Detail |
|---|---|
| Created by | Integration Service at session runtime |
| Stored in | Cache directory on the Integration Service server (configured per session) |
| Contains | In-memory snapshot of the lookup table, serialized to disk for reuse across sessions |
| Human-readable? | No — binary format |
| Migration target | No direct migration — in Databricks, lookups become broadcast joins or Delta table lookups. Shared/persistent caches map to Delta tables cached in memory via cache() or DBIO cache |
SA Tip: Customers with large shared lookup caches (gigabytes of reference data loaded into memory) are tuning around Informatica's limitation that each session must load its own lookup copy. In Databricks, a single broadcast join or Delta cache serves all parallel tasks — this is an automatic performance improvement, not an additional engineering task.
Session Log and Workflow Log¶
Informatica writes detailed run logs for every session and workflow execution to the file system.
| Property | Detail |
|---|---|
| Created by | Integration Service / Workflow Service at runtime |
| Stored in | Log directory on the server, configurable path |
| Contains | Row counts (sourced, inserted, updated, rejected, deleted), timestamps, SQL statements, error messages |
| Human-readable? | Yes — plain text with timestamps |
| Migration target | Not migrated — but use historical logs during assessment to identify run frequency, volumes, and SLA compliance of each pipeline |
Quick Reference — Informatica File/Artifact Types¶
| Artifact | Format | Human-Readable | Migration Target |
|---|---|---|---|
| Repository export | .xml |
Yes (verbose) | Starting point for offline inventory and parsing |
| Parameter file | .par |
Yes (key-value) | Databricks job params / Secrets / DAB config |
| Workflow + Session | .xml (exported) |
Yes | Databricks Workflow |
| Bad record file | .bad |
Yes (delimited) | DLT quarantine table / Delta bad records table |
| Lookup cache index | .idx |
No (binary) | Not migrated — replaced by broadcast join |
| Lookup cache data | .dat |
No (binary) | Not migrated — replaced by broadcast join |
| Session log | .log |
Yes | Assessment only — not migrated |
10. Migration Assessment and Artifact Inventory¶
How to Build the Estate Inventory¶
The richest inventory source is the PowerCenter repository. If you have read-only access to the repository database, run SQL queries against the REP_* views to extract:
- Mapping count by folder — total scope, organized by team/domain
- Last run date per session — filter to 90-day active window to identify live pipelines
- Transformation type distribution — count of each transformation type across all mappings (identifies custom transform exposure)
- Session run statistics — rows processed, duration, partition count — for volume sizing
- Workflow task counts — number of sessions per workflow (identifies orchestration complexity)
If repository access is not available, request an XML export of all production folders — this gives you everything needed for offline parsing.
Complexity Scoring Model¶
Score each mapping for migration effort using these factors:
| Factor | Low (1 pt) | Medium (2 pts) | High (3 pts) |
|---|---|---|---|
| Transformation count | < 10 | 10–30 | > 30 |
| Custom transformations | None | 1–2 Java/C transforms | 3+ custom transforms |
| Lookup count | 0–2 | 3–6 | 7+ |
| Uncached lookups | None | 1–2 | 3+ |
| IDQ components | None | 1–2 rules | Complex ruleset |
| Source types | DB table / delimited file | Fixed-width / XML | COBOL / PowerExchange / SAP |
| Pre/post session SQL | None | Simple truncate/insert | Complex procedural SQL |
| Parameter complexity | Simple date/path params | Multi-scope params | Dynamic connection switching |
| Workflow complexity | Linear session chain | Decision branches | Complex worklet nesting |
Scoring bands:
| Total Score | Classification | Typical Migration Approach |
|---|---|---|
| 1–5 | Simple | Direct translation to PySpark/SQL notebook |
| 6–12 | Moderate | PySpark notebook with careful port-by-port validation |
| 13–20 | Complex | DLT pipeline with incremental migration and parallel run |
| 21+ | High | Architect-led redesign — likely a rewrite, not a lift-and-shift |
Risk Areas Specific to Informatica¶
| Risk Area | What to Look For | Mitigation |
|---|---|---|
| Custom Java/C transformations | Custom Transformation or External Procedure objects in mappings |
Requires porting logic to PySpark UDFs — must find/review source code |
| PowerExchange CDC | Real-time change capture from mainframe, Oracle, DB2 | Different migration path — maps to Databricks DLT + CDC connectors |
| Dynamic connections | Session parameters override source/target connections at runtime | Complex config management — must be translated to Databricks multi-environment patterns |
| Stored procedure calls | Inline Stored Procedure transformations or heavy pre/post SQL |
DB-side logic must be ported or DB retained as dependency |
| Large Sorter usage | Sorter transformations on multi-billion-row datasets | Sorter uses temp disk in Informatica — verify Databricks cluster has enough shuffle space |
| IDQ embedded in critical pipelines | Data Processor, Standardizer, Matcher in production mappings |
Must build equivalent DQ layer before pipeline is considered migrated |
| Shared Lookup across sessions | Persistent caches shared across multiple sessions | Translate to Delta lookup table with appropriate caching strategy |
| File-based session handoffs | Session A writes a file; Session B reads it within same workflow | Intermediate files must become Delta tables in the Databricks version |
11. Migration Mapping to Databricks¶
Building Blocks¶
| Informatica Concept | Databricks Equivalent |
|---|---|
| Mapping | Databricks notebook (PySpark / SQL) or DLT pipeline |
| Mapplet | Shared PySpark function / Unity Catalog SQL function |
| Session | Databricks Workflow task (wraps a notebook or DLT pipeline) |
| Workflow | Databricks Workflow |
| Worklet | Databricks Workflow task cluster / reusable task group |
| Repository | Databricks workspace + Unity Catalog (metadata) + Git (code) |
| Folder | Databricks workspace folder / Unity Catalog schema |
| Parameter file | Databricks job parameters + Databricks Secrets + DAB config |
| Integration Service (DTM) | Databricks cluster (driver + workers) |
| Informatica Grid | Databricks cluster auto-scaling |
Transformations to Databricks¶
| Informatica Transformation | Databricks Equivalent |
|---|---|
| Source Qualifier | spark.read.jdbc() / spark.read.format() with pushdown filters |
| Expression | withColumn() / SQL SELECT expressions |
| Filter | filter() / WHERE clause |
| Router | filter() per branch / CASE WHEN with multiple writes |
| Joiner | join() / SQL JOIN |
| Lookup (cached) | Broadcast join (broadcast() hint) / Delta table join |
| Lookup (uncached) | Rewrite as join — do not replicate row-by-row lookup pattern |
| Aggregator | groupBy().agg() / SQL GROUP BY |
| Sorter | orderBy() / sort() — use only where required (output ordering) |
| Update Strategy | Delta Lake MERGE INTO (upsert) |
| Sequence Generator | monotonically_increasing_id() or Delta sequence table |
| Stored Procedure | PySpark JDBC call or migrate stored proc logic to Spark |
| Custom / External Procedure | PySpark UDF (Python or Scala) — requires source code review |
| Data Processor (IDQ) | Custom parsing + from_xml() / external library |
| Standardizer (IDQ) | Custom UDF / Databricks partner DQ tool |
| Matcher (IDQ) | Databricks ML entity resolution / partner tool |
Orchestration¶
| Informatica Concept | Databricks Equivalent |
|---|---|
| Workflow | Databricks Workflow |
| Session task | Databricks Workflow notebook task / DLT pipeline task |
| Command task | Databricks Workflow notebook task (Python script) |
| Decision task | Conditional task dependency (on_success / on_failure branches) |
| Worklet | Reusable Databricks Workflow task cluster or modular DAG |
| Event Wait (file arrival) | Databricks file-arrival trigger |
| External scheduler trigger | Databricks REST API job trigger from Control-M/Autosys |
| Scheduling | Databricks Workflow schedule (cron or continuous) |
| Parameter passing | Databricks Workflow job parameters + dbutils.widgets |
Governance and Metadata¶
| Informatica Concept | Databricks Equivalent |
|---|---|
| Repository | Unity Catalog (metadata) + Git repo (code) |
| Metadata Manager lineage | Unity Catalog automated lineage |
| Source/Target definitions | Unity Catalog table definitions |
| Metadata extensions (tags) | Unity Catalog tags |
| Version control (check-in/out) | Git + Databricks Repos / Workspace Git integration |
| Deployment group promotion | Databricks Asset Bundles (DAB) with CI/CD pipeline |
| Folder security | Unity Catalog schema-level permissions |
Data Quality¶
| Informatica IDQ Concept | Databricks Equivalent |
|---|---|
| Profile / scorecard | Databricks Lakehouse Monitoring |
Validate / Exception |
Delta Live Tables expect() with quarantine table |
Standardizer |
PySpark UDF / Databricks partner (Soda, Great Expectations) |
Matcher |
Databricks ML entity resolution |
Bad record file (.bad) |
DLT quarantine Delta table |
| DQ rules embedded in mapping | DLT expectations co-located with transformation logic |
What Doesn't Map Cleanly¶
| Informatica Capability | Why It's Hard | Recommended Approach |
|---|---|---|
| Uncached Lookup (row-by-row) | Anti-pattern that hides in thousands of mappings; replicating it in Spark would be equally slow | Rewrite as a broadcast join or Delta table lookup during migration |
| PowerExchange real-time CDC | Fundamentally different architecture — reads DB transaction logs at low level | Use Databricks DLT + Debezium/Qlik/Fivetran CDC connectors as the new source layer |
| Custom C transformations | Compiled binary — no source visibility; logic not in the repository | Locate C source files, port to PySpark UDF — may require original developer |
| Dynamic connection switching | Session-level parameter overrides connection objects at runtime | Translate to Databricks multi-environment config; complex if connections switch per row |
| Pre/post session SQL with DDL | DROP INDEX / TRUNCATE TABLE / ANALYZE TABLE before/after load |
Migrate as notebook cells preceding/following the main load logic in a Workflow task |
| Sorter on very large datasets | Informatica Sorter uses unlimited disk temp space; Spark shuffle has different constraints | Right-size cluster shuffle storage; push ORDER BY to the consuming query where possible |
| MDM Hub integration | MDM is a separate product with its own data model and survivorship rules | Out of scope for lakehouse migration — must be addressed as a separate MDM track |
| Worklet reuse across many workflows | Worklets are shared objects; changes affect all consumers | Translate to Unity Catalog functions or shared DLT pipeline modules with explicit versioning |