Informatica BDM — SA Migration Guide¶

Purpose: Give a Solution Architect enough depth to assess an Informatica Big Data Management (BDM) estate, understand its moving parts, and map a migration path to Databricks.

This is not a developer guide. You won't be building BDM mappings. You will be walking customer sites, reviewing architecture diagrams, asking the right questions, and scoping what it takes to move to a modern lakehouse platform.

Architecture Diagrams¶

Informatica BDM Platform Architecture¶

How BDM's product suite fits together — from developer tooling through runtime execution to cluster resources.

Scroll to zoom · Drag to pan

flowchart TD
    subgraph DEV["Developer Layer"]
        DEV_TOOL["Developer Tool\n(Eclipse-based IDE)\nBuild mappings, profiles, rules visually"]
        ADMIN_CONSOLE["Administrator Console\nManage services, connections, permissions"]
    end

    subgraph REPO["Repository Layer"]
        MRS["Model Repository Service (MRS)\nCentral metadata store — mappings,\nworkflows, connections, data objects"]
        VERSION["Version Control\nOptional Git / SVN integration"]
        MRS <--> VERSION
    end

    DEV_TOOL -- check in / check out --> MRS

    subgraph RUNTIME["Runtime Layer"]
        DIS["Data Integration Service (DIS)\nOrchestrates mapping execution;\nchooses engine per mapping config"]
        BLAZE["Blaze Engine\nInformatica native Hadoop engine\n(legacy; pushes jobs to YARN)"]
        SPARK["Spark Engine\nNative Spark execution on YARN/K8s"]
        HIVE["Hive Engine\nPushes logic as HiveQL"]
        DIS --> BLAZE
        DIS --> SPARK
        DIS --> HIVE
    end

    MRS -- deploys mapping configs --> DIS

    subgraph CLUSTER["Big Data Cluster"]
        YARN["YARN / Kubernetes\nCluster resource manager"]
        HDFS["HDFS / S3 / ADLS\nDistributed storage"]
        BLAZE --> YARN
        SPARK --> YARN
        YARN --> HDFS
    end

    subgraph ORCH["Orchestration Layer"]
        WF_SVC["Workflow Service\nChains mapping tasks, decisions, loops"]
        MONITOR["Monitoring / Administrator\nJob status, logs, reruns"]
        EXTERNAL["External Scheduler\nControl-M / Autosys / UC4"]
        EXTERNAL -- triggers workflow --> WF_SVC
        WF_SVC --> MONITOR
    end

    WF_SVC -- submits mapping tasks --> DIS

    subgraph GOV["Governance & Quality"]
        IDQ["Informatica Data Quality (IDQ)\nProfiling, rules, scorecards"]
        MM["Metadata Manager\nLineage, impact analysis"]
        AXON["Axon Data Governance\nBusiness glossary, data stewardship"]
    end

    DIS -- profiling samples --> IDQ
    MRS -- metadata feed --> MM

Informatica BDM as ETL — Data Flow Between Systems¶

How BDM sits between source systems and targets in a typical enterprise big data pipeline.

Scroll to zoom · Drag to pan

flowchart LR
    subgraph SOURCES["Source Systems"]
        RDBMS[("Relational DB\nOracle / SQL Server / Teradata")]
        HDFS_SRC["HDFS / Cloud Storage\nS3 / ADLS / GCS"]
        MF["Mainframe / Flat Files\nFixed-width / Delimited"]
        KAFKA["Streaming\nKafka topics"]
    end

    subgraph BDM["Informatica BDM ETL Layer"]
        direction TB
        READERS["Source Qualifiers & Readers\nRead from RDBMS, HDFS, files, Kafka"]
        TRANSFORMS["Transformation Logic\nExpression · Joiner · Aggregator\nFilter · Lookup · Router · Sorter\nNormalizer · Rank · Union"]
        SCRATCH["Temporary / Intermediate\nHDFS staging files or Hive tables\nbetween mapping stages"]
        WRITERS["Target Writers\nWrite to RDBMS, Hive, HDFS,\nHBase, cloud storage"]

        READERS --> TRANSFORMS
        TRANSFORMS -- intermediate staging --> SCRATCH
        SCRATCH --> TRANSFORMS
        TRANSFORMS --> WRITERS
    end

    subgraph TARGETS["Target Systems"]
        HIVE_TGT[("Hive / Impala\nAnalytic SQL layer")]
        DW[("Data Warehouse\nTeradata / Redshift / Synapse")]
        DATALAKE["Data Lake\nHDFS / S3 / ADLS (Parquet / ORC)"]
        RDBMS_TGT[("Operational DB\nOracle / SQL Server")]
    end

    subgraph ORCH["Orchestration"]
        WF["BDM Workflow\nChains mapping tasks\nacross the pipeline"]
    end

    SOURCES --> READERS
    WRITERS --> TARGETS
    WF -. schedules .-> BDM

Sections¶

Ecosystem Overview
Mappings — The Core Building Block
Data Formats and Schema
Parallelism and Scaling Model
Project Structure and Version Control
Orchestration: Workflows and External Schedulers
Metadata, Lineage, and Impact Analysis
Data Quality with IDQ
Informatica BDM File Formats Reference
Migration Assessment and Artifact Inventory
Migration Mapping to Databricks

1. Ecosystem Overview¶

What Is Informatica BDM?¶

Informatica Big Data Management (BDM) is Informatica's enterprise ETL platform built specifically for big data environments — Hadoop, Hive, cloud object storage, and Spark. It extends the PowerCenter lineage into the distributed computing era, allowing customers to build PowerCenter-style mappings that execute natively on a Spark or Hive engine rather than on an Integration Service appliance.

Where PowerCenter runs transformations on a dedicated Integration Service node, BDM pushes the execution to the cluster — the transformation logic compiles to Spark jobs or HiveQL and runs where the data lives. This is the fundamental architectural difference that makes BDM relevant to customers who have already moved data to HDFS, S3, or ADLS.

BDM is part of Informatica's broader IDMC (Intelligent Data Management Cloud) story, but most legacy BDM estates are on-premises or on cloud VMs, not in the managed cloud service.

The Informatica BDM Product Context¶

Product	What It Does	Migration Relevance
Informatica BDM	ETL/ELT for Hadoop, Hive, Spark, and cloud storage	High — the primary migration target
PowerCenter	Traditional ETL on dedicated Integration Service nodes	Often runs alongside BDM; separate migration track
IDQ (Data Quality)	Profiling, rule-based validation, scorecards	Medium — quality rules must be replicated
Metadata Manager	Lineage, impact analysis, catalog integration	High — use for estate inventory
Axon Data Governance	Business glossary, data stewardship	Low — governance layer, not pipeline logic
IDMC / CDI	Cloud-managed SaaS successor to BDM/PowerCenter	Relevant if customer is already migrating to IDMC
Enterprise Data Catalog (EDC)	Auto-scanning catalog with ML-assisted lineage	High — often the richest source of inventory data

SA Tip: Many enterprise customers run both PowerCenter and BDM — PowerCenter handles relational ETL (RDBMS to RDBMS), BDM handles big data ETL (HDFS, Hive, S3). When scoping a migration, ask which product owns which pipelines. The migration paths are different and should be tracked separately.

Why Customers Want to Migrate¶

Driver	What It Means for the Engagement
Hadoop cluster decommission	Many customers are shutting down on-prem Hadoop; BDM loses its execution target
License cost	BDM licensing is expensive — Spark-native pipelines remove the vendor dependency
Talent	Informatica developers are expensive; Python/SQL skills are far more available
Cloud-first mandate	BDM requires managing cluster connectivity and service infrastructure; Databricks is managed
IDMC transition pressure	Informatica's own roadmap points to IDMC — customers face forced migration anyway
Performance	Databricks' Photon engine often outperforms BDM's Spark submission overhead

Key Discovery Questions¶

Before scoping a migration, ask:

How many BDM mappings are in active production use? (Ask for last-run date — many estates have 30–50% unused mappings)
Which execution engine is in use — Blaze, Spark, or Hive? (Blaze is legacy; Spark is the current path)
What are the source and target systems? (Hive, HDFS, S3, ADLS, RDBMS, HBase)
Are mappings run standalone or via BDM Workflows? Are external schedulers involved?
Does the customer use IDQ rules embedded in mappings, or is IDQ a separate profiling-only tool?
Is Metadata Manager or EDC in use, and is its metadata current?
Are there reusable mapplets (shared transformation logic) — and how many mappings depend on them?
What does the Hadoop cluster look like — is it being decommissioned alongside the migration?
Is there any IDMC / CDI migration already underway that would affect scope?

2. Mappings — The Core Building Block¶

The Mapping¶

In Informatica BDM, the mapping is the fundamental unit of work — equivalent to an Ab Initio graph or a Talend Job. A mapping is a visual dataflow diagram: data enters through source objects, passes through a series of transformation objects, and exits to target objects.

BDM mappings look almost identical to PowerCenter mappings and are built in the same Developer Tool IDE. The key difference is that a BDM mapping carries an execution engine setting that tells the Data Integration Service whether to run it on Spark, Hive, or Blaze — rather than on the Integration Service's own process.

A single BDM estate can have hundreds to thousands of mappings across multiple projects, with wide variation in complexity, recency, and active use.

Transformations¶

A transformation is a single processing step inside a mapping. BDM ships with a large library of transformations, and the set of available transformations varies by execution engine (not all transformations are supported on all engines).

Category	Transformations	What They Do
Input	`Source Qualifier`, `Read`	Read from RDBMS, Hive, HDFS, files, Kafka
Transform	`Expression`, `Normalizer`, `Rank`	Field-level calculations, flattening arrays, top-N
Filter & Route	`Filter`, `Router`	Conditional row filtering and multi-output branching
Join & Lookup	`Joiner`, `Lookup`	Dataset joins, enrichment from static or dynamic sources
Aggregate	`Aggregator`, `Sorter`	Group-by aggregations, deterministic sorting
Set Operations	`Union`	Combining multiple inputs into one output stream
Output	`Write`, `Target`	Write to RDBMS, Hive, HDFS, cloud storage, HBase
Quality	`Data Masking`, `Address Validator`	PII masking, IDQ-integrated validation

SA Tip: Ask whether the customer uses active/passive lookups extensively. Passive lookups cache reference data in memory — on large datasets this can cause memory pressure on the cluster. When mapping to Databricks, these become broadcast joins or Delta table lookups, and the sizing implications are different.

Mapplets¶

A mapplet is a reusable sub-mapping — a named, encapsulated transformation logic block that can be embedded inside multiple parent mappings. Mapplets are the BDM equivalent of Ab Initio's wrapped graphs or Talend's subjobs.

When inventorying an estate, mapplets are a critical dependency artifact — a mapplet used by 40 mappings must be migrated before any of those mappings. Identify all mapplets and their fan-out early.

Mapping Tasks vs. Mappings¶

A mapping defines the transformation logic. A mapping task (also called a session in PowerCenter terminology) is a runtime configuration that links a mapping to specific connections, parameter values, and execution settings. One mapping can have multiple mapping tasks — each pointing to different environments or connection sets.

Concept	Description	Databricks Equivalent
Mapping	The transformation logic definition	Databricks notebook / DLT pipeline
Mapping Task	Runtime config — connections, parameters, engine	Job configuration / task parameters
Workflow	Ordered collection of tasks with dependencies	Databricks Workflow

3. Data Formats and Schema¶

How BDM Describes Data¶

BDM uses data objects to represent the schema of sources and targets. Rather than standalone schema definition files (like Ab Initio's DML), BDM embeds schema definitions inside the mapping object itself or links them to reusable physical data objects stored in the repository.

Schema Artifact	Description	Migration Note
Physical Data Object	Reusable source/target schema definition registered in the repository	Maps to a Unity Catalog table or Delta schema definition
Source Qualifier	Read-side transformation that embeds a SQL override query	Inline query logic must be extracted and replicated in Databricks
Flat File Data Object	Schema for delimited or fixed-width flat files	Maps to a Databricks Auto Loader schema or COPY INTO definition
Hive data object	Schema for Hive tables with partition column definitions	Maps directly to a Delta/Hive metastore table
Complex File Data Object	Schema for Avro, Parquet, ORC, JSON	Directly supported in Databricks; minimal migration effort

Data Types¶

BDM uses Informatica's native type system, which maps to underlying platform types at runtime. Type mapping complexity varies by execution engine.

Informatica Type	Description	Databricks / Spark Type
`String`	Variable-length string (up to 104,857,600 bytes)	`StringType`
`Integer`	32-bit signed integer	`IntegerType`
`Bigint`	64-bit signed integer	`LongType`
`Double`	64-bit IEEE floating point	`DoubleType`
`Decimal(p,s)`	Arbitrary-precision decimal	`DecimalType(p,s)`
`Date/Time`	Date and timestamp with format	`TimestampType` / `DateType`
`Binary`	Raw bytes	`BinaryType`

SA Tip: BDM's Decimal handling differs slightly between Spark and Hive execution modes — precision truncation rules are not identical. If the customer's pipelines rely on specific decimal rounding behavior, flag this as a testing risk during migration.

Source Qualifier SQL Overrides¶

In PowerCenter and BDM, developers frequently use SQL overrides in Source Qualifier transformations to push filter predicates, join conditions, or custom query logic into the database rather than processing records in the ETL engine. These overrides are hand-written SQL embedded inside the mapping XML.

Migration relevance: SQL overrides are one of the most common sources of hidden logic in BDM estates. They don't appear in the visual dataflow — they're in a property field of the Source Qualifier. During inventory, always extract and review SQL overrides; they are often the most complex part of the mapping to replicate.

4. Parallelism and Scaling Model¶

How BDM Achieves Performance¶

BDM's parallelism model is fundamentally different from PowerCenter's, and it's the reason BDM exists as a separate product. BDM doesn't manage partitioning itself — it delegates execution to the cluster and lets the underlying engine (Spark or Hive) handle distributed processing.

Execution Engines¶

When a mapping task runs, the Data Integration Service selects an execution engine based on the mapping's engine setting and the available cluster connection.

Engine	How It Works	Migration Relevance
Spark	Compiles the mapping to a Spark application, submits to YARN or K8s	Current-generation; most likely what active pipelines use
Hive	Translates mapping logic to HiveQL; executed by Hive on MapReduce or Tez	Older; slower; may indicate legacy pipelines that need re-evaluation
Blaze	Informatica's proprietary Hadoop engine (pre-Spark era)	Legacy — if seen, this mapping hasn't been updated in years
Native (DIS)	Runs on the Integration Service node, not the cluster	Used when source/target is RDBMS only; no cluster dependency

SA Tip: Ask which engine the customer's production mappings use. Blaze-mode mappings are a migration red flag — they indicate the customer is running old pipelines on deprecated technology, and they often contain patterns (like HDFS-specific tricks) that have no direct Spark equivalent. Budget extra time for Blaze-to-Spark translation.

Partitioning in BDM¶

Unlike PowerCenter's explicit partition points, BDM relies on Spark's automatic data partitioning. The key configuration levers are:

Setting	Description	Databricks Equivalent
Cluster configuration	Number of YARN executors, executor memory, executor cores	Databricks cluster node count and instance type
spark.sql.shuffle.partitions	Number of partitions after a shuffle (join, aggregation)	Same Spark config, tunable per job
Partition hints in Hive targets	Writing output to HDFS partitioned by column values	Delta table partitioning by column
Bucketing in Hive sources	Pre-bucketed Hive tables that enable bucket joins	Delta Z-ordering / liquid clustering

What Customers Have Likely Configured¶

At a typical BDM customer site, you will find: - A centrally managed YARN cluster shared across BDM and other Hadoop workloads - Spark execution for active mappings, with a handful of Blaze or Hive mappings from earlier phases - Executor sizing set at the cluster level, not per-mapping — individual mapping performance tuning is often minimal - Intermediate staging in HDFS (temp tables or temp files) between mapping stages

Migration relevance: BDM's performance profile on Databricks won't automatically match what the customer sees on YARN — cluster sizing, Photon optimization, and Delta caching change the picture. Set expectations that performance validation is a migration task, not an assumption.

5. Project Structure and Version Control¶

The Model Repository Service (MRS)¶

The Model Repository Service is BDM's central metadata store — analogous to Ab Initio's EME or PowerCenter's Repository Service. It stores every mapping, mapplet, workflow, physical data object, and connection definition, versioned and organized into projects and folders.

The MRS is the authoritative source for estate inventory. Any migration assessment should start with an MRS export or direct metadata query.

Project Hierarchy¶

MRS
 └── Project: Finance_BDM
       ├── Folder: Ingest
       │     ├── Mapping: load_transactions
       │     ├── Mapping: load_customers
       │     └── Mapplet: cleanse_address
       ├── Folder: Transform
       │     ├── Mapping: calc_daily_balances
       │     └── Workflow: nightly_finance_load
       └── Folder: Reference_Data
             └── Mapping: refresh_lookup_tables

Concept	What It Is	Equivalent
Project	Top-level container for all related artifacts	Git repository / Databricks workspace folder
Folder	Logical grouping within a project	Subdirectory / Databricks folder
Mapping	Transformation logic definition	Databricks notebook / Python script / DLT pipeline
Mapplet	Reusable sub-mapping	Databricks shared library / reusable DLT component
Workflow	Orchestration definition	Databricks Workflow / Airflow DAG
Physical Data Object	Reusable source/target schema	Unity Catalog table definition

Version Control¶

BDM supports optional integration with Git or SVN for version control, but many on-premises deployments rely solely on the MRS's built-in versioning (check-in/check-out model). In environments without Git integration: - Artifact history is stored only in the MRS database - Rollback requires restoring from MRS, not from a git revert - Code review is done through the Developer Tool UI, not pull requests

SA Tip: If the customer doesn't have Git integrated with BDM, the migration project itself will need to establish Git practices for the first time. Budget time for this cultural shift — it's often underestimated when customers are coming from an MRS-only versioning model.

Deployment and Promotion¶

BDM uses an export/import workflow for promoting artifacts between environments (DEV → QA → PROD). Artifacts are exported as XML files (.inf or .xml format) and imported into the target environment's MRS. This is BDM's equivalent of a deployment pipeline.

6. Orchestration: Workflows and External Schedulers¶

BDM Workflows¶

A workflow is BDM's built-in orchestration unit — an ordered collection of mapping tasks, decision tasks, and assignment tasks that run in sequence or conditionally based on outcomes.

Concept	Description	Databricks Equivalent
Workflow	Named orchestration unit containing tasks	Databricks Workflow / Airflow DAG
Mapping Task	A step that runs a specific mapping with its runtime config	Databricks Workflow Task (notebook or Python)
Decision Task	A conditional branch — routes execution based on output variable	`if_else` condition in Workflow or Airflow BranchOperator
Assignment Task	Sets a workflow variable for use in downstream tasks	Workflow parameter / Airflow XCom
Start/End Events	Define trigger and termination of the workflow	Workflow schedule / trigger
Link Conditions	Boolean expressions on workflow variable values that control task sequencing	Databricks Workflow task dependency conditions

Workflow Structure Example¶

A typical BDM workflow for a nightly data load:

Workflow: NIGHTLY_FINANCE_LOAD
  Task 1: load_transactions        (runs first)
  Task 2: load_customers           (depends on Task 1 success)
  Task 3: calc_daily_balances      (depends on Task 2 success)
  Task 4: refresh_lookup_tables    (depends on Task 1 success — runs in parallel with Task 2)
  Decision: if Task 3 fails → send_alert_email

This maps directly to a Databricks Workflow with task dependencies and on_failure notification tasks.

Worklets¶

A worklet is a reusable sub-workflow — a named, encapsulated set of tasks that can be embedded in multiple parent workflows. Worklets are the orchestration equivalent of mapplets. During migration, identify all worklets and their fan-out before migrating the workflows that depend on them.

Monitoring¶

The Informatica Administrator Console and Workflow Monitor provide job status visibility — running jobs, historical run logs, success/failure states, and manual rerun capability.

BDM Feature	Databricks Equivalent
Workflow Monitor — real-time status	Databricks Workflows UI — run monitoring
Historical run logs	Workflow run history
Manual task rerun	Task repair run
Email notification on failure	Workflow notification policy
Session log (per-mapping execution detail)	Spark driver/executor logs in cluster UI

External Schedulers¶

Many enterprise BDM environments use an external enterprise scheduler to trigger workflows — BMC Control-M, IBM Workload Scheduler, Autosys, or UC4. In this topology: - The external scheduler handles cross-system timing (e.g., "wait for the upstream file to arrive from the mainframe") - BDM Workflow handles intra-pipeline dependencies (task ordering within the pipeline)

Migration relevance: If the customer uses an external scheduler, the migration has two layers: replacing BDM Workflows with Databricks Workflows, AND replacing or re-integrating with the external scheduler. This is frequently underestimated. Ask whether the external scheduler will be replaced or retained — retaining it simplifies migration but leaves a dependency.

7. Metadata, Lineage, and Impact Analysis¶

What the MRS Tracks¶

The MRS stores not just artifacts but relationships between them — which mappings use which data objects, which mapplets are embedded in which mappings, which workflows chain which tasks. This relationship graph is the foundation for impact analysis.

Relationship Type	Example
Mapping uses Physical Data Object	`load_transactions` reads `transactions_hive_object`
Mapping embeds Mapplet	`load_customers` uses `cleanse_address` mapplet
Workflow runs Mapping Task	`NIGHTLY_LOAD` executes `load_transactions_task`
Mapping Task runs Mapping	`load_transactions_task` executes `load_transactions`
Mapping writes target	`load_transactions` writes to `hive.finance.transactions`

Metadata Manager and Enterprise Data Catalog¶

BDM's richer lineage tools are Metadata Manager (MM) and the newer Enterprise Data Catalog (EDC):

Tool	Purpose	Migration Use
Metadata Manager	Cross-system lineage — connects BDM pipelines to source and target systems	Estate inventory, cross-system dependency mapping
Enterprise Data Catalog (EDC)	Auto-scanned catalog with ML-assisted lineage, data domains, and profiling	Best source of current, trusted inventory data
Axon	Business glossary and stewardship	Lower migration relevance — governance layer

SA Tip: EDC is the most powerful inventory tool in the Informatica ecosystem, but it's a separate licensed product. Ask whether the customer has it. If they do, its scan results can significantly accelerate the migration assessment — you get a cross-system lineage picture without manually tracing each mapping.

Impact Analysis¶

Before migrating any shared artifact (mapplet, physical data object, connection), use the MRS or Metadata Manager to identify all dependents. Key impact analysis questions:

Which mappings use this mapplet?
Which workflows depend on this mapping task?
Which mappings write to this Hive table (which downstream processes might break)?
Which mappings read from this source physical data object?

Common Lineage Gaps¶

Gap	Why It Happens	SA Implication
Shell script invocations	BDM can call OS commands; lineage stops at the script boundary	Manually trace what the script does
SQL override queries	Embedded SQL in Source Qualifiers bypasses visual lineage	Extract and document all SQL overrides
Parameter-driven paths	File paths or table names passed as runtime parameters	Lineage only visible at runtime, not in static analysis
Blaze-mode mappings	Some Blaze-specific patterns are not captured in modern EDC scans	Validate lineage manually for Blaze mappings

8. Data Quality with IDQ¶

What Informatica Data Quality Does¶

IDQ (Informatica Data Quality) is Informatica's standalone data quality product, tightly integrated with BDM. It provides profiling, rule-based validation, matching/deduplication, and scorecard reporting.

In a BDM estate, IDQ appears in two forms: 1. Standalone IDQ mappings — separate pipelines that profile datasets and produce quality reports 2. Embedded IDQ transformations — quality rules embedded directly inside BDM mappings as transformations

The second form is the migration-critical one — embedded rules are business logic that must be replicated.

IDQ Transformations in BDM Mappings¶

Transformation	What It Does	Databricks Equivalent
`Data Quality` transformation	Applies IDQ rule specifications to records; routes bad records to reject port	Delta Live Tables `expect()` / Great Expectations
`Address Validator`	Validates and standardizes postal addresses against reference data	Third-party address validation API or custom UDF
`Data Masking`	Masks or tokenizes PII fields	Databricks column masking / Unity Catalog row filters
`Match`	Probabilistic or deterministic record matching / deduplication	Custom Spark dedup logic or third-party MDM
`Exception` transformation	Routes records that fail quality rules to an exception table	DLT quarantine pattern

SA Tip: Embedded IDQ transformations are frequently overlooked in migration scoping because they don't look like "ETL" — they're quality operations. When reviewing mappings, flag every IDQ transformation and document the rule it enforces. These are regulatory or business-critical rules the customer will want to verify are preserved post-migration.

Scorecards and Quality Reports¶

IDQ produces scorecards — dashboards that aggregate quality metrics across datasets over time. If the customer actively uses scorecards, they will need a Databricks-native equivalent (DLT quality metrics, custom Delta tables feeding a BI tool) or a Databricks Data Quality product after migration.

9. Informatica BDM File Formats Reference¶

When you walk into a customer's BDM environment, you will encounter a specific set of file types in every export directory and repository. Knowing what each file is and what it means for migration is essential for artifact inventory.

`.inf` — Informatica Export File (XML)¶

The .inf file is the primary export format for BDM artifacts — mappings, mapplets, workflows, physical data objects, and connections. When a developer or DBA exports objects from the Developer Tool or Administrator Console, they get .inf files. These are XML-formatted and human-readable.

Property	Detail
Created by	Developer Tool export wizard, Administrator Console, `infacmd` CLI
Stored in	File system (deployment packages, migration archives)
Contains	Full definition of one or more BDM objects — transformations, port connections, expressions, parameter references
Human-readable?	Yes — XML format, verbose but inspectable
Migration target	Each `.inf` mapping is a migration unit — maps to a Databricks notebook, Python script, or DLT pipeline

Example .inf snippet (mapping header):

<POWERMART EXPORT_VERSION="186.0911" REPOSITORY_VERSION="186.0911">
  <REPOSITORY NAME="BDM_REPO" VERSION="186" CODEPAGE="UTF-8">
    <FOLDER NAME="Finance" GROUP="" OWNER="admin">
      <MAPPING NAME="load_transactions" ISVALID="YES" VERSIONNUMBER="1">
        <TRANSFORMATION TYPE="Source Qualifier" NAME="SQ_transactions">
          ...
        </TRANSFORMATION>
      </MAPPING>
    </FOLDER>
  </REPOSITORY>
</POWERMART>

SA Tip: The .inf count in active deployment packages is your best proxy for migration scope — these are the artifacts that are actually being promoted to production. A customer may have 2,000 objects in the MRS but only 300 .inf files in their current release package.

`.xml` — Repository Object Export (MRS Backup / Metadata Extract)¶

The .xml export is a broader repository export — often produced by full-repository backups or Metadata Manager extracts. Unlike the .inf format used for individual deployments, .xml exports cover entire projects or folders and are used for DR and migration.

Property	Detail
Created by	`infacmd mrs ExportObjects`, Administrator Console backup
Stored in	Backup directories, migration staging areas
Contains	Full project or folder contents including all artifact types and their relationships
Human-readable?	Yes — XML, but more verbose than `.inf`; relationships between objects are encoded
Migration target	Use as the source for estate inventory parsing — extract mapping names, transformation types, port connections

SA Tip: Ask the DBA team for a full MRS export early in the engagement. Even if the customer hasn't recently backed up, most sites have an .xml export from a recent DR drill. This is often faster to work with than live MRS queries for counting and categorizing artifacts.

`.prm` — Parameter File¶

The .prm file defines runtime parameters for BDM mappings and workflows — file paths, database connection names, date ranges, environment-specific settings. Parameters are passed to mapping tasks and workflow sessions at runtime.

Property	Detail
Created by	Manually by developers; sometimes generated by wrapper scripts
Stored in	File system — typically a `parameters/` directory adjacent to the mapping deployment
Contains	Key=value pairs for workflow-level or mapping-level parameters (`$PM_TARGET_TABLE`, `$LOAD_DATE`, etc.)
Human-readable?	Yes — plain text key=value format
Migration target	Maps to Databricks Workflow job parameters, widgets, or Databricks Secrets for credentials

Example .prm file:

[load_transactions.mapping]
$$LOAD_DATE=2024-01-15
$$SOURCE_SCHEMA=finance_prod
$$TARGET_TABLE=transactions_delta

[load_transactions.session]
$PMTargetFileDir=/data/output/transactions/
$PMBadFileDir=/data/rejects/

SA Tip: Parameter files are the primary mechanism for environment promotion in BDM — the same mapping runs in dev vs. prod because of different .prm files. Understanding the parameter set for each mapping tells you how many environment-specific configurations exist and what Databricks environment variables or secrets need to be created to replace them.

`.wfl` — Workflow Export File¶

The .wfl format is used by some Informatica export tools to package workflow definitions separately from mapping logic. Not all environments produce .wfl files — many workflows are packaged inside .inf exports — but some customer tooling generates them explicitly.

Property	Detail
Created by	Workflow Manager export, some third-party deployment tools
Stored in	Deployment packages, migration archives
Contains	Workflow task definitions, link conditions, decision logic, assignment variables
Human-readable?	Yes — XML format
Migration target	Each `.wfl` maps to a Databricks Workflow definition (JSON) or an Airflow DAG

`.mpp` — Mapping Specification (Developer Tool Project File)¶

The .mpp file is the Developer Tool's project metadata file — it records the project structure, folder hierarchy, and object registration within the MRS. It is analogous to a project file in an IDE.

Property	Detail
Created by	Developer Tool when creating or exporting a project
Stored in	Project root directory
Contains	Project name, folder structure, registered MRS objects, version metadata
Human-readable?	Yes — XML
Migration target	Not a direct migration artifact — use to understand folder/project structure for Databricks workspace organization

Parameter Override Files (environment-specific)¶

Beyond named .prm files, BDM environments often have environment-specific override files — flat text files named by environment (e.g., prod_overrides.ini, qa_connections.cfg) that override default connection names, file paths, and schema names.

Property	Detail
Created by	DBAs or deployment engineers manually
Stored in	Deployment infrastructure, often outside the MRS
Contains	Connection name mappings, schema aliases, path overrides
Human-readable?	Yes — plain text
Migration target	Maps to Databricks environment-specific job configuration or Databricks Secrets

SA Tip: Environment override files are frequently undocumented and stored only on production servers — not in the MRS or version control. Ask specifically for these during the discovery phase. They often contain the production connection names and paths that aren't visible from a pure MRS export.

Quick-Reference Summary Table¶

File Type	Created By	Human-Readable?	Migration Target
`.inf`	Developer Tool export, `infacmd`	Yes (XML)	Primary migration unit — notebook / DLT pipeline per mapping
`.xml`	MRS backup / full export	Yes (XML, verbose)	Estate inventory — parse for artifact counts and relationships
`.prm`	Developer / deployment scripts	Yes (key=value)	Databricks job parameters / Secrets
`.wfl`	Workflow Manager export	Yes (XML)	Databricks Workflow definition / Airflow DAG
`.mpp`	Developer Tool	Yes (XML)	Workspace structure reference only
Override files	DBA / deployment teams	Yes (plain text)	Databricks environment config / Secrets

10. Migration Assessment and Artifact Inventory¶

Building the Inventory¶

Start from the MRS or EDC — not from file system counts. The MRS is the authoritative list of what's been built; the file system may have stale exports mixed in with active artifacts.

Step 1 — Export the MRS object list: Use infacmd mrs ListAllObjects or the Administrator Console to get a flat list of all mappings, mapplets, workflows, and physical data objects with their project/folder path and last-modified date.

Step 2 — Filter to active production: Cross-reference the MRS object list with Workflow Monitor run history (export from the Workflow Monitor or query the repository database directly). A mapping that has not appeared in a successful workflow run in 12+ months is a candidate for decommissioning.

Step 3 — Categorize mappings by execution engine: For each mapping task in production, note the engine setting (Spark, Hive, Blaze, Native). Blaze and Hive mappings are higher effort; Spark mappings translate most directly to Databricks.

Step 4 — Extract SQL overrides: Parse the .inf or .xml exports for Source Qualifier transformations with custom SQL. These represent hidden logic not visible in the visual dataflow.

Step 5 — Identify shared artifacts: List all mapplets and physical data objects, and for each, count how many mappings reference them. High-fan-out artifacts must be migrated before the mappings that depend on them.

Complexity Scoring¶

Score each mapping on the following dimensions to produce a migration effort estimate:

Dimension	Low (1)	Medium (2)	High (3)
Transformation count	< 10 transformations	10–30	> 30
Execution engine	Spark	Hive	Blaze or Native (non-cluster)
SQL overrides	None	1–2 simple queries	Complex queries, dynamic SQL
Embedded IDQ transformations	None	1–2 rules	Multiple rules or Match/Dedup
Mapplet dependencies	None	1–2 mapplets	3+ mapplets or nested mapplets
Source/target types	HDFS / Hive / S3	RDBMS	Mainframe, HBase, exotic connectors
Parameter complexity	Simple date/path params	Multiple env overrides	Dynamic parameter generation

Total score guidance: - 7–10: Low complexity — strong candidate for direct migration - 11–15: Medium complexity — requires design review before migrating - 16–21: High complexity — break into sub-tasks; consider redesigning rather than translating

Risk Areas Specific to BDM¶

Risk Area	Why It Matters
Blaze-mode mappings	Blaze is deprecated; no direct Spark equivalent for some Blaze-specific behaviors
Hive-to-Delta schema evolution	Hive partition schemes don't always translate cleanly to Delta's partition strategy
Source Qualifier SQL overrides	Hidden business logic that won't surface in a standard mapping export review
Passive lookup caching	Memory-based lookup caching behavior differs between BDM and Spark broadcast joins
IDQ Match/Dedup transformations	Probabilistic matching has no out-of-box Databricks equivalent
External scheduler dependencies	Cross-system timing logic lives outside BDM and is easy to overlook
Parameter file proliferation	Many environments have accumulated dozens of parameter files without documentation
Shared physical data objects	A single data object used by many mappings creates a migration sequencing constraint

11. Migration Mapping to Databricks¶

Building Blocks¶

BDM Concept	Databricks Equivalent	Notes
Mapping	Databricks notebook / Python script / DLT pipeline	One BDM mapping → one Databricks task; DLT for streaming or quality-heavy mappings
Mapplet	Shared Python module / Databricks utility library / DLT component	Register in Unity Catalog; treat as a shared dependency
Physical Data Object	Unity Catalog table definition	Register schemas in Unity Catalog for discoverability
Data Integration Service (DIS)	Databricks cluster / DLT runtime	Compute is managed; no service to administer
Model Repository Service (MRS)	Databricks workspace + Unity Catalog	Metadata and lineage in Unity Catalog; code in Git
Developer Tool (IDE)	VS Code / Databricks IDE / Databricks Repos	Git-native development replaces check-in/check-out

Transformations / Components¶

BDM Transformation	Databricks / Spark Equivalent
Source Qualifier	`spark.read` / Auto Loader / `COPY INTO` + optional SQL filter
Expression	`withColumn()` / SQL `SELECT` expressions
Filter	`filter()` / `WHERE` clause
Router	Multiple `filter()` branches / `CASE WHEN` splits
Aggregator	`groupBy().agg()` / SQL `GROUP BY`
Joiner	`join()` / SQL `JOIN`
Lookup (passive)	`broadcast()` join / Delta table lookup
Lookup (active)	Streaming stateful join / foreachBatch lookup
Sorter	`orderBy()` / SQL `ORDER BY`
Normalizer	`explode()` / `posexplode()`
Rank	`Window.partitionBy().orderBy()` + `row_number()`
Union	`union()` / SQL `UNION ALL`
Target	`spark.write` / Delta `MERGE INTO` / `INSERT INTO`
Data Quality transformation	DLT `expect()` / Great Expectations / custom UDF
Data Masking	Unity Catalog column masks / `sha2()` / Databricks pseudonymization

Orchestration¶

BDM Concept	Databricks Equivalent
Workflow	Databricks Workflow
Mapping Task	Databricks Workflow Task (notebook or Python)
Worklet	Reusable Databricks Workflow (called as a repair task or a sub-workflow trigger)
Decision Task	Workflow task `if_else` condition
Assignment Task	Workflow task output parameter passed to downstream tasks
Parameter File	Databricks job parameters / widgets / Databricks Secrets
Workflow Monitor	Databricks Workflows UI + Lakehouse Monitoring
External Scheduler	Retain existing scheduler triggering Databricks Workflows via REST API / CLI

Governance and Lineage¶

BDM Concept	Databricks Equivalent
MRS (repository)	Unity Catalog + Databricks workspace Git
Metadata Manager lineage	Unity Catalog automatic lineage (column-level for notebooks/DLT)
Enterprise Data Catalog (EDC)	Unity Catalog + Databricks Data Intelligence Platform catalog
Axon (business glossary)	Unity Catalog tags and comments / external governance tool
Check-in / check-out version control	Git branches + pull requests
Folder / project structure	Databricks workspace folders + Git repo structure

Data Quality¶

BDM / IDQ Concept	Databricks Equivalent
IDQ rule specifications	DLT `expect()` / `expect_or_drop()` / `expect_or_fail()`
Reject port (bad record routing)	DLT quarantine pattern — bad records to separate Delta table
IDQ Scorecard	Lakehouse Monitoring quality metrics dashboard
Data profiling	Databricks Data Profiling / Lakehouse Monitoring column statistics
Match/Dedup	Custom Spark dedup (window functions + similarity scoring) or third-party MDM
Address Validator	Third-party address validation API called via UDF

What Doesn't Map Cleanly¶

BDM Capability	Why It's Hard	Recommended Approach
Blaze execution engine	Blaze has proprietary behaviors not in open Spark	Rewrite as native Spark; no automated conversion
Passive lookup caching	BDM caches entire lookup tables in Integration Service memory; Spark broadcast join behavior differs at large scale	Test broadcast join performance at production data volumes; consider Delta table lookups for very large reference datasets
Probabilistic Match/Dedup (IDQ)	No out-of-box Databricks equivalent	Custom Spark ML similarity, or third-party MDM tool (Reltio, Profisee, Tamr)
Address Validator	Relies on licensed Informatica reference data	Replace with third-party API (Google Maps, SmartyStreets, Melissa) or custom logic
Source Qualifier SQL pushdown	Logic is embedded in property fields, not visible in visual export	Must extract, document, and explicitly replicate as Databricks SQL or Auto Loader filter
Worklet fan-out	A worklet embedded in 50 workflows becomes 50 individual workflow task patterns	Define a shared Databricks Workflow or library pattern that all workflows call
External scheduler cross-system timing	Timing logic lives in Control-M or Autosys, not in BDM	Decide whether to migrate scheduler to Databricks Workflows or retain external scheduler and integrate via REST API
Parameter file proliferation	Dozens of environment-specific parameter files with no central registry	Consolidate into Databricks Secrets scope per environment during migration