Challenges in Relational Multi-Table Synthetic Data Generation

akshatgupta6
6 hours ago
5 min read

1. Introduction

Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex.

Relational databases do not exist in isolation. They contain relationships that define how information flows across the system:

Foreign keys (parent → child)
Many-to-one (Student → Fees)
One-to-many (Teacher → Courses)
Many-to-many join tables (Student ↔ Courses)
Deep multi-level dependencies (Student → StudentCourses → Marks → Rank)

Every table influences others, and these dependencies must be preserved in the synthetic version.This combination of relational structure + statistical realism makes multi-table synthetic generation one of the toughest challenges in modern data science.

2. Problems With Relational Multi-Table Synthetic Data

Relational synthetic data must simultaneously maintain two distinct but equally critical properties.

A. Structural Correctness

Every foreign key must point to a valid parent record:

No orphan rows
No mismatched or invalid IDs
Correct table row counts and link consistency

B. Statistical Realism

Beyond structural correctness, the synthetic dataset must behave like the real dataset:

Distributions of numeric values (mean, variance, skewness)
Categorical patterns and frequencies
Joint relationships between columns
Cross-table correlationsBehavioral patterns (e.g., students with more courses tend to pay more fees)
Preserving cardinality patterns (fees per student, courses per teacher)

Most simple generators fail because they satisfy either structure or realism, but not both.GANs excel at statistical realism but know nothing about foreign key rules.Statistical models excel at structural constraints but miss deep correlations.

Relational synthetic data requires both sides to work together.

3. Our Use Case: A School Management Database

To illustrate the complexity of relational multi-table synthetic data generation, let’s look at a real example: a school management system.This system tracks students, teachers, courses, financial activity, academic performance, and various administrative operations.Before attempting synthetic generation, it’s important to understand the structure of the data and how the tables depend on each other.

Below is a high-level overview of the main tables and their columns.

Students

Student_id; enrollment_number; first_name last_name; date_of_birth; gender; address_line1; address_line2; city; state; postal_code; phone; email; guardian_name; guardian_contac;t enrollment_date; status ; homeroom_teacher_id

Teachers

teacher_id; employee_code; first_name; last_name; qualification; experience_years; department; phone; email; hire_date; status

Courses

course_id; course_code; name; description; credits; level; department; is_active; lead_teacher_id

Student Courses

student_course_id; student_id; course_id; academic_year; term; enrollment_status; grade_letter; grade_points

Marks

mark_id; student_course_id; assessed_by_teacher_id; assessment_type; max_score; score_obtained; weightage_percent; assessment_date

Fees

fee_id; student_id; fee_type; amount_due; amount_paid; due_date; payment_date; payment_mode; status; approved_by_teacher_id

Games Participation

games_participation_id; student_id; game_name; team_name; level; position_played; achievement; season_year; coach_teacher_id; manager_teacher_id

Canteen Transactions

transaction_id; student_id; transaction_date; item_description; quantity; amount; payment_method; authorized_by_teacher_id

Transport Assignments

transport_id; student_id; route_name; pickup_point; dropoff_point; vehicle_number; driver_name; driver_contact; valid_from; valid_to; monthly_fee; route_incharge_teacher_id

Course Instructors

course_instructor_id; course_id; teacher_id; academic_year; term; role

This schema alone shows why relational synthetic data is challenging:

Multiple one-to-many relationships
Several many-to-many join tables
Deep dependency chains (e.g., students → student_courses → marks)Multiple foreign keys pointing to teachers
Highly diverse data types (UUIDs, dates, numbers, categorical values)
Behavioral/transactional tables (canteen, transport, fees)

This is the dataset we use to evaluate relational synthetic data techniques — and it clearly goes beyond what simple or single-table generative models can handle.

4. First Approach Tested: SDV Multitable – HMA Synthesizer

SDV offers HMASynthesizer for multi-table generation.

HMA (Hierarchical Modeling Algorithm) is a statistical, non-GAN method.

Why we tested HMA

It supports relational structures
Ensures referential integrity automatically
Easy implementation for 1 parent → 1 child scenarios

We tested it on the simplest pair:

Tables tested

students (parent)
fees (child, references student_id)

This is the minimum relational test case.

5. SDV HMA Experimental Output & Results

A. Synthetic table generation

HMA successfully produced synthetic versions of:

synthetic_students
synthetic_fees

B. Referential integrity check

HMA automatically enforces FK relationships.

Result:

FK Violations: 0

This means:

Every row in fees.student_id correctly referenced an ID in students.student_id.
Structural integrity was preserved.

C. Cardinality Distribution Comparison

We compared fees-per-student distribution in real vs synthetic.

Real fees-per-student stats

count 991

mean 5.045

std 2.164

min 1

25% 3

50% 5

75% 6

max 15

Synthetic fees-per-student stats

count 988

mean 5.060

std 2.209

min 1

25% 4

50% 5

75% 7

max 11

Interpretation

The synthetic distribution closely matches the real one in mean and variance
Quartile shifts (3→4, 6→7) are mild
Maximum child count is reduced (15→11), a common issue in statistical models
Overall, the HMA output shows good statistical alignment for this simple scenario

6. Relational Score Using Non-GAN Approach (HMA)

Metric	Result
Foreign Key Integrity	100% (0 violations)
Cardinality Preservation	~94% similarity
Distribution Similarity (mean/std)	High match
Relational Realism Score	High for 1→N relations

From both FK checks + cardinality alignment:HMA successfully handled simple hierarchical relationships.

7. Limitations of HMA for Our Full Schema

While HMA worked for two tables, it fundamentally cannot scale to the full complexity of our school management database.

Below are the key limitations.

1. No Support for Many-to-Many Tables

HMA requires a tree-shaped relational structure.

Tables like:

student_courses (student_id, course_id)
course_instructors (course_id, teacher_id)

represent graphs, not trees.

HMA cannot model a child table with two parents.

2. No Support for GAN Training

HMA is purely statistical.

This means:

No ability to learn high-dimensional correlations
Poor performance on complex interactions

3. Cannot Learn Multi-Table Patterns

Relationships like:

“Students with tougher courses tend to score lower marks”
“Students in certain batches pay fees differently”
“Teachers influencing student performance across multiple tables”

cannot be learned by statistical hierarchies.

4. Fails on Synthetic Transactional or Behavioral Data

Tables like:

canteen_transactions
events_participation
attendance_logs

contain high-frequency behavioral data.

These require GAN-based sequence modeling or temporal modeling, which HMA simply cannot handle.

5. Struggles With UUID-Based Identifiers

Our database uses UUIDs for:

student_id
fee_id
course_id
etc.

UUIDs have extremely high cardinality, and statistical models cannot learn their structure.

This results in:

Reused IDs
Incorrect string formats
Potential FK mismatches

We had to manually enforce regex-based UUID generation to fix this.

6. Cannot Handle Deep Graph-Shaped Schemas

Our dependency chains are not simple:

students → student_courses → marks

students → fees

courses → course_instructors → teachers

students → events_participation → event_details

HMA cannot:

Propagate relationships through multiple levels
Learn cross-table correlations
Handle graph-centric relational patterns

It is limited to shallow, tree-like schemas only.

7. Cardinality Drift

HMA tends to:

Under-estimate maximum values
Smooth out spikes
Lose long-tail behavior

This leads to synthetic datasets that look “average” but lose realistic extremes.

8. Conclusion

Our initial experiments show:

What HMA can do well

Works for simple 1→N relationships
Perfect foreign key integrity
Good basic distribution alignment
Fast and simple to use

Where HMA fails

Many-to-many tables
Multi-parent relationships
Deep dependency structures
UUID-heavy schemas
High-dimensional correlations
Behavioral or transactional datasets
Any graph-shaped schema

Given all these limitations, HMA cannot be used for our full school-management database.

To generate realistic, structurally correct synthetic data for the entire relational system, a more advanced approach is required:

A multi-table GAN-based pipeline that models each table individually, conditions child tables on parent embeddings, and reconstructs relational integrity after generation.

This approach enables:

High realism
Support for many-to-many tables
Deep relational consistency
True cross-table correlation learning
Correct UUID formatting
Full graph-level reconstruction

This method is significantly more powerful than HMA and is suitable for real-world relational databases like ours.