top of page

Challenges in Relational Multi-Table Synthetic Data Generation



1. Introduction


Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex.

Relational databases do not exist in isolation. They contain relationships that define how information flows across the system:

  • Foreign keys (parent → child)

  • Many-to-one (Student → Fees)

  • One-to-many (Teacher → Courses)

  • Many-to-many join tables (Student ↔ Courses)

  • Deep multi-level dependencies (Student → StudentCourses → Marks → Rank)

Every table influences others, and these dependencies must be preserved in the synthetic version.This combination of relational structure + statistical realism makes multi-table synthetic generation one of the toughest challenges in modern data science.


2. Problems With Relational Multi-Table Synthetic Data


Relational synthetic data must simultaneously maintain two distinct but equally critical properties.


A. Structural Correctness


Every foreign key must point to a valid parent record:

  • No orphan rows

  • No mismatched or invalid IDs

  • Correct table row counts and link consistency


B. Statistical Realism


Beyond structural correctness, the synthetic dataset must behave like the real dataset:

  • Distributions of numeric values (mean, variance, skewness)

  • Categorical patterns and frequencies

  • Joint relationships between columns

  • Cross-table correlationsBehavioral patterns (e.g., students with more courses tend to pay more fees)

  • Preserving cardinality patterns (fees per student, courses per teacher)



Most simple generators fail because they satisfy either structure or realism, but not both.GANs excel at statistical realism but know nothing about foreign key rules.Statistical models excel at structural constraints but miss deep correlations.

Relational synthetic data requires both sides to work together.



3. Our Use Case: A School Management Database


To illustrate the complexity of relational multi-table synthetic data generation, let’s look at a real example: a school management system.This system tracks students, teachers, courses, financial activity, academic performance, and various administrative operations.Before attempting synthetic generation, it’s important to understand the structure of the data and how the tables depend on each other.

Below is a high-level overview of the main tables and their columns.


Students

Student_id; enrollment_number; first_name  last_name;  date_of_birth;  gender;  address_line1;  address_line2;  city;  state; postal_code; phone;  email;  guardian_name;  guardian_contac;t  enrollment_date; status ; homeroom_teacher_id


Teachers

teacher_id; employee_code; first_name; last_name; qualification; experience_years; department; phone; email; hire_date; status


Courses

course_id; course_code; name; description; credits; level; department; is_active; lead_teacher_id


Student Courses

student_course_id; student_id; course_id; academic_year; term; enrollment_status; grade_letter; grade_points


Marks

mark_id; student_course_id; assessed_by_teacher_id; assessment_type; max_score; score_obtained; weightage_percent; assessment_date


Fees

fee_id; student_id; fee_type; amount_due; amount_paid; due_date; payment_date; payment_mode; status; approved_by_teacher_id


Games Participation

games_participation_id; student_id; game_name; team_name; level; position_played; achievement; season_year; coach_teacher_id; manager_teacher_id


Canteen Transactions

transaction_id; student_id; transaction_date; item_description; quantity; amount; payment_method; authorized_by_teacher_id


Transport Assignments

transport_id; student_id; route_name; pickup_point; dropoff_point; vehicle_number; driver_name; driver_contact; valid_from; valid_to; monthly_fee; route_incharge_teacher_id


Course Instructors

course_instructor_id; course_id; teacher_id; academic_year; term; role



This schema alone shows why relational synthetic data is challenging:

  • Multiple one-to-many relationships

  • Several many-to-many join tables

  • Deep dependency chains (e.g., students → student_courses → marks)Multiple foreign keys pointing to teachers

  • Highly diverse data types (UUIDs, dates, numbers, categorical values)

  • Behavioral/transactional tables (canteen, transport, fees)

This is the dataset we use to evaluate relational synthetic data techniques — and it clearly goes beyond what simple or single-table generative models can handle.


4. First Approach Tested: SDV Multitable – HMA Synthesizer


SDV offers HMASynthesizer for multi-table generation.

HMA (Hierarchical Modeling Algorithm) is a statistical, non-GAN method.


Why we tested HMA

  • It supports relational structures

  • Ensures referential integrity automatically

  • Easy implementation for 1 parent → 1 child scenarios

We tested it on the simplest pair:

Tables tested

  • students (parent)

  • fees (child, references student_id)

This is the minimum relational test case.


5. SDV HMA Experimental Output & Results


A. Synthetic table generation

HMA successfully produced synthetic versions of:

  • synthetic_students

  • synthetic_fees


B. Referential integrity check

HMA automatically enforces FK relationships.


Result:

FK Violations: 0 


This means:

  • Every row in fees.student_id correctly referenced an ID in students.student_id.

  • Structural integrity was preserved.


C. Cardinality Distribution Comparison


We compared fees-per-student distribution in real vs synthetic.


Real fees-per-student stats


count    991

mean   5.045

std      2.164

min      1

25%      3

50%      5

75%      6

max     15 


Synthetic fees-per-student stats


count    988

mean     5.060

std      2.209

min      1

25%      4

50%      5

75%      7

max     11 


Interpretation


  • The synthetic distribution closely matches the real one in mean and variance

  • Quartile shifts (3→4, 6→7) are mild

  • Maximum child count is reduced (15→11), a common issue in statistical models

  • Overall, the HMA output shows good statistical alignment for this simple scenario



6. Relational Score Using Non-GAN Approach (HMA)


Metric

Result

Foreign Key Integrity

100% (0 violations)

Cardinality Preservation

~94% similarity

Distribution Similarity (mean/std)

High match

Relational Realism Score

High for 1→N relations

From both FK checks + cardinality alignment:HMA successfully handled simple hierarchical relationships.



7. Limitations of HMA for Our Full Schema


While HMA worked for two tables, it fundamentally cannot scale to the full complexity of our school management database.

Below are the key limitations.


1. No Support for Many-to-Many Tables


HMA requires a tree-shaped relational structure.

Tables like:

  • student_courses (student_id, course_id)

  • course_instructors (course_id, teacher_id)

represent graphs, not trees.

HMA cannot model a child table with two parents.


2. No Support for GAN Training


HMA is purely statistical.

This means:

  • No ability to learn high-dimensional correlations

  • Poor performance on complex interactions


3. Cannot Learn Multi-Table Patterns


Relationships like:

  • “Students with tougher courses tend to score lower marks”

  • “Students in certain batches pay fees differently”

  • “Teachers influencing student performance across multiple tables”

cannot be learned by statistical hierarchies.


4. Fails on Synthetic Transactional or Behavioral Data


Tables like:

  • canteen_transactions

  • events_participation

  • attendance_logs

contain high-frequency behavioral data.

These require GAN-based sequence modeling or temporal modeling, which HMA simply cannot handle.


5. Struggles With UUID-Based Identifiers


Our database uses UUIDs for:

  • student_id

  • fee_id

  • course_id

  • etc.


UUIDs have extremely high cardinality, and statistical models cannot learn their structure.

This results in:

  • Reused IDs

  • Incorrect string formats

  • Potential FK mismatches

We had to manually enforce regex-based UUID generation to fix this.


6. Cannot Handle Deep Graph-Shaped Schemas


Our dependency chains are not simple:

students → student_courses → marks  

students → fees  

courses → course_instructors → teachers  

students → events_participation → event_details

HMA cannot:

  • Propagate relationships through multiple levels

  • Learn cross-table correlations

  • Handle graph-centric relational patterns

It is limited to shallow, tree-like schemas only.


7. Cardinality Drift

HMA tends to:

  • Under-estimate maximum values

  • Smooth out spikes

  • Lose long-tail behavior

This leads to synthetic datasets that look “average” but lose realistic extremes.


8. Conclusion


Our initial experiments show:


What HMA can do well


  • Works for simple 1→N relationships

  • Perfect foreign key integrity

  • Good basic distribution alignment

  • Fast and simple to use


Where HMA fails


  • Many-to-many tables

  • Multi-parent relationships

  • Deep dependency structures

  • UUID-heavy schemas

  • High-dimensional correlations

  • Behavioral or transactional datasets

  • Any graph-shaped schema


Given all these limitations, HMA cannot be used for our full school-management database.

To generate realistic, structurally correct synthetic data for the entire relational system, a more advanced approach is required:


 A multi-table GAN-based pipeline that models each table individually, conditions child tables on parent embeddings, and reconstructs relational integrity after generation.


This approach enables:

  • High realism

  • Support for many-to-many tables

  • Deep relational consistency

  • True cross-table correlation learning

  • Correct UUID formatting

  • Full graph-level reconstruction


This method is significantly more powerful than HMA and is suitable for real-world relational databases like ours.



 
 
 

Recent Posts

See All

Comments


bottom of page