Challenges in Relational Multi-Table Synthetic Data Generation
- akshatgupta6
- 6 hours ago
- 5 min read
1. Introduction
Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex.
Relational databases do not exist in isolation. They contain relationships that define how information flows across the system:
Foreign keys (parent → child)
Many-to-one (Student → Fees)
One-to-many (Teacher → Courses)
Many-to-many join tables (Student ↔ Courses)
Deep multi-level dependencies (Student → StudentCourses → Marks → Rank)
Every table influences others, and these dependencies must be preserved in the synthetic version.This combination of relational structure + statistical realism makes multi-table synthetic generation one of the toughest challenges in modern data science.
2. Problems With Relational Multi-Table Synthetic Data
Relational synthetic data must simultaneously maintain two distinct but equally critical properties.
A. Structural Correctness
Every foreign key must point to a valid parent record:
No orphan rows
No mismatched or invalid IDs
Correct table row counts and link consistency
B. Statistical Realism
Beyond structural correctness, the synthetic dataset must behave like the real dataset:
Distributions of numeric values (mean, variance, skewness)
Categorical patterns and frequencies
Joint relationships between columns
Cross-table correlationsBehavioral patterns (e.g., students with more courses tend to pay more fees)
Preserving cardinality patterns (fees per student, courses per teacher)
Most simple generators fail because they satisfy either structure or realism, but not both.GANs excel at statistical realism but know nothing about foreign key rules.Statistical models excel at structural constraints but miss deep correlations.
Relational synthetic data requires both sides to work together.
3. Our Use Case: A School Management Database
To illustrate the complexity of relational multi-table synthetic data generation, let’s look at a real example: a school management system.This system tracks students, teachers, courses, financial activity, academic performance, and various administrative operations.Before attempting synthetic generation, it’s important to understand the structure of the data and how the tables depend on each other.
Below is a high-level overview of the main tables and their columns.
Students
Student_id; enrollment_number; first_name last_name; date_of_birth; gender; address_line1; address_line2; city; state; postal_code; phone; email; guardian_name; guardian_contac;t enrollment_date; status ; homeroom_teacher_id
Teachers
teacher_id; employee_code; first_name; last_name; qualification; experience_years; department; phone; email; hire_date; status
Courses
course_id; course_code; name; description; credits; level; department; is_active; lead_teacher_id
Student Courses
student_course_id; student_id; course_id; academic_year; term; enrollment_status; grade_letter; grade_points
Marks
mark_id; student_course_id; assessed_by_teacher_id; assessment_type; max_score; score_obtained; weightage_percent; assessment_date
Fees
fee_id; student_id; fee_type; amount_due; amount_paid; due_date; payment_date; payment_mode; status; approved_by_teacher_id
Games Participation
games_participation_id; student_id; game_name; team_name; level; position_played; achievement; season_year; coach_teacher_id; manager_teacher_id
Canteen Transactions
transaction_id; student_id; transaction_date; item_description; quantity; amount; payment_method; authorized_by_teacher_id
Transport Assignments
transport_id; student_id; route_name; pickup_point; dropoff_point; vehicle_number; driver_name; driver_contact; valid_from; valid_to; monthly_fee; route_incharge_teacher_id
Course Instructors
course_instructor_id; course_id; teacher_id; academic_year; term; role
This schema alone shows why relational synthetic data is challenging:
Multiple one-to-many relationships
Several many-to-many join tables
Deep dependency chains (e.g., students → student_courses → marks)Multiple foreign keys pointing to teachers
Highly diverse data types (UUIDs, dates, numbers, categorical values)
Behavioral/transactional tables (canteen, transport, fees)
This is the dataset we use to evaluate relational synthetic data techniques — and it clearly goes beyond what simple or single-table generative models can handle.
4. First Approach Tested: SDV Multitable – HMA Synthesizer
SDV offers HMASynthesizer for multi-table generation.
HMA (Hierarchical Modeling Algorithm) is a statistical, non-GAN method.
Why we tested HMA
It supports relational structures
Ensures referential integrity automatically
Easy implementation for 1 parent → 1 child scenarios
We tested it on the simplest pair:
Tables tested
students (parent)
fees (child, references student_id)
This is the minimum relational test case.
5. SDV HMA Experimental Output & Results
A. Synthetic table generation
HMA successfully produced synthetic versions of:
synthetic_students
synthetic_fees
B. Referential integrity check
HMA automatically enforces FK relationships.
Result:
FK Violations: 0
This means:
Every row in fees.student_id correctly referenced an ID in students.student_id.
Structural integrity was preserved.
C. Cardinality Distribution Comparison
We compared fees-per-student distribution in real vs synthetic.
Real fees-per-student stats
count 991
mean 5.045
std 2.164
min 1
25% 3
50% 5
75% 6
max 15
Synthetic fees-per-student stats
count 988
mean 5.060
std 2.209
min 1
25% 4
50% 5
75% 7
max 11
Interpretation
The synthetic distribution closely matches the real one in mean and variance
Quartile shifts (3→4, 6→7) are mild
Maximum child count is reduced (15→11), a common issue in statistical models
Overall, the HMA output shows good statistical alignment for this simple scenario
6. Relational Score Using Non-GAN Approach (HMA)
Metric | Result |
Foreign Key Integrity | 100% (0 violations) |
Cardinality Preservation | ~94% similarity |
Distribution Similarity (mean/std) | High match |
Relational Realism Score | High for 1→N relations |
From both FK checks + cardinality alignment:HMA successfully handled simple hierarchical relationships.
7. Limitations of HMA for Our Full Schema
While HMA worked for two tables, it fundamentally cannot scale to the full complexity of our school management database.
Below are the key limitations.
1. No Support for Many-to-Many Tables
HMA requires a tree-shaped relational structure.
Tables like:
student_courses (student_id, course_id)
course_instructors (course_id, teacher_id)
represent graphs, not trees.
HMA cannot model a child table with two parents.
2. No Support for GAN Training
HMA is purely statistical.
This means:
No ability to learn high-dimensional correlations
Poor performance on complex interactions
3. Cannot Learn Multi-Table Patterns
Relationships like:
“Students with tougher courses tend to score lower marks”
“Students in certain batches pay fees differently”
“Teachers influencing student performance across multiple tables”
cannot be learned by statistical hierarchies.
4. Fails on Synthetic Transactional or Behavioral Data
Tables like:
canteen_transactions
events_participation
attendance_logs
contain high-frequency behavioral data.
These require GAN-based sequence modeling or temporal modeling, which HMA simply cannot handle.
5. Struggles With UUID-Based Identifiers
Our database uses UUIDs for:
student_id
fee_id
course_id
etc.
UUIDs have extremely high cardinality, and statistical models cannot learn their structure.
This results in:
Reused IDs
Incorrect string formats
Potential FK mismatches
We had to manually enforce regex-based UUID generation to fix this.
6. Cannot Handle Deep Graph-Shaped Schemas
Our dependency chains are not simple:
students → student_courses → marks
students → fees
courses → course_instructors → teachers
students → events_participation → event_details
HMA cannot:
Propagate relationships through multiple levels
Learn cross-table correlations
Handle graph-centric relational patterns
It is limited to shallow, tree-like schemas only.
7. Cardinality Drift
HMA tends to:
Under-estimate maximum values
Smooth out spikes
Lose long-tail behavior
This leads to synthetic datasets that look “average” but lose realistic extremes.
8. Conclusion
Our initial experiments show:
What HMA can do well
Works for simple 1→N relationships
Perfect foreign key integrity
Good basic distribution alignment
Fast and simple to use
Where HMA fails
Many-to-many tables
Multi-parent relationships
Deep dependency structures
UUID-heavy schemas
High-dimensional correlations
Behavioral or transactional datasets
Any graph-shaped schema
Given all these limitations, HMA cannot be used for our full school-management database.
To generate realistic, structurally correct synthetic data for the entire relational system, a more advanced approach is required:
A multi-table GAN-based pipeline that models each table individually, conditions child tables on parent embeddings, and reconstructs relational integrity after generation.
This approach enables:
High realism
Support for many-to-many tables
Deep relational consistency
True cross-table correlation learning
Correct UUID formatting
Full graph-level reconstruction
This method is significantly more powerful than HMA and is suitable for real-world relational databases like ours.




Comments