awesome MLSS Newsletter
Posts
Organic or Synthetic? Both are good for data! - Awesome MLSS Newsletter

Organic or Synthetic? Both are good for data! - Awesome MLSS Newsletter

8th Edition

Awesome MLSS
September 20, 2025

Ever since the first generation of intelligent LLMs came out, there were theories about how ultimately we’d run out of data from the internet to train.

“Well,” the optimists said, “we’ll just use the LLMs to train themselves!”

“That would never work!” yelled the pessimists.

Despite varied thoughts on this subject, of late, we have seen models improve in quality and reasoning. Smaller models trained on smaller datasets are providing better capabilities than behemoths of the last few generations.

This is a frontier research problem, and today, we want to discuss an important, perhaps the most important component of it — synthetic data, right after a few updates.

Upcoming Summer School Announcements

Deadlines for two of these summer schools close within twelve days, make sure to apply on time!

Title	Deadline	Dates
Latin American School of Artificial Intelligence (LASAI) 2025 - Lima, Peru	Sep 30	Oct 27 - Oct 31, 2025
Winter School on Causality and Explainable AI 2025 - Paris, France	Oct 01	Oct 20 - Oct 24, 2025
Annual Nepal AI School 2025 - Kathmandu, Nepal	Oct 16	Dec 29, 2025 - Jan 8, 2026
Northern Lights Deep Learning Winter School 2026 - Tromso, Norway	Nov 15	Jan 5 - Jan 9, 2026

To see the full list, visit and sign up to Awesome MLSS and stay updated!

What’s happening in AI?

For most of us, this might not be fresh in memory, but do you remember Phi 1’s release in 2023? The world was still reeling from the possibilities with ChatGPT, Google was scrambling to release their first AI based product, and Microsoft quietly sneaked in a research paper that went viral.

Phi 1 was not a frontier model by any measure, but it had a curious data selection strategy. The paper, titled “Textbooks are all you need”, showed that by using merely 6B tokens from textbook quality web data, and 1B tokens synthetically generated as textbooks using GPT 3.5, you could actually get really performant models. While not the best, it definitely outdid many other models of similar sizes with a fraction of the dataset size.

Why Bother With Synthetic Data?

There is ample evidence pointing towards data selection and quality enhancing reasoning in LLMs.

Two researchers in 2023 from Meta FAIR and MBZUAI began exploring how LLMs store information. In “Physics of Language Models: Part 3.1 Knowledge Storage and Extraction”, they trained smaller GPT-2 style models on custom data to probe them. Their findings reveal that if LLMs are trained purely on prosaic information, they memorize it, but cannot retrieve it. For example, after pretraining on thousands of biographies, the model could not answer birth dates unless it had seen at least a few question-answer examples. Once exposed, it generalized well.

The reason this matters is because it shows LLM capabilities come not just from architecture, but also from dataset modifications.

Similar evidence was found in many other papers:

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modelling (Apple, 2023) showed reformatting web data sped up pretraining by almost 3×.
As mentioned in our last newsletter, Deepseek-R1 was largely trained on synthetic data generated by prompting DeepSeek R1-Zero, yet achieved strong reasoning.

Takeaway: Data selection and structure directly affect reasoning ability, not just raw scale.

Synthesis of Synthetic Data

In the remaining sections, we will discuss the key overarching phases of how synthetic datasets are built

Before that, two training terms need clarification:

Pre-training: Classic LLM training where the model learns to complete passages, capturing basic token causality.
Post-training: Instruction or multistep reasoning training, where reasoning chains may be multi-step or single-pass depending on architecture.

Filtering

Given a large amount of raw, unfiltered data, it is obviously desirable to eliminate the low-information items first. There are several papers that discuss this, but we will be focusing on the methods used by RefinedWeb, Nemotron-CC, and DCLM. All three began with Common Crawl, a massive dump of internet articles, and then filtered it to find the best possible training content.

Key filtering steps include:

URL Filtering: Base URLs are automatically filtered to exclude undesirable domains.
Text Extraction: Since Common Crawl provides HTML, not plain text, it must be converted. Existing extractors of HTML to Plaintext are often heuristic, which means there is a tradeoff on extraction quality. The authors experimented with multiple frameworks and came to their own conclusions as to which worked best based on experimentation.
Language Identification: Essential for restricting to a single language or partitioning multilingual datasets.
Repetition Filtering and Deduplication: Web articles are often syndicated or near-duplicates. Exact string matching is impractical at scale, so algorithms like MinHash or Bloom Filters are preferred, though fuzzy matching or string matching are sometimes used.
Quality Filtering: Learned classifiers help disqualify low-quality or spam-like content.

Now, when it comes to filtering, there are tradeoffs - filtering too aggressively can cause smaller datasets, while non aggressive filtering can cause low quality datasets.

Fortunately, there seems to be a clear answer - using higher quality datasets and repeating data in them will yield better performance than a larger low quality dataset. A paper from 2025 conducted ablation studies on how repeating the same documents on a subset of high quality documents can provide better results. DatalogyAI also reported getting better results on models using a highly filtered and curated set of raw data as opposed to using the whole dataset. In fact, their curated dataset with 60B tokens outperformed the superset dataset with 600B tokens.

This does come with an important caveat, however - strictly limiting the dataset could reduce the model’s knowledge base. Improvement of performance, as stated, depends on a mix of knowledge and reasoning abilities, therefore it is important to use both in good measure.

Takeaway: Quality + repetition > scale, though overly strict curation risks shrinking the model’s knowledge base.

Reselection and Shuffling

What do you want your model to excel at — arithmetic, coding, knowledge Q&A, or balanced conversation? These choices shape the pipeline.

Reselection and shuffling go beyond randomness: they decide how much weight different skills receive. Oversample coding → sharper logic but weaker dialogue. Underrepresent math → weak precision. The goal is usually a blend.

As an example, in their latest iteration of the Phi series, Phi 4, the researchers prioritised a strong mix of web, code, question, and source rephrased QnA datasets. This included targeted acquisitions of data from sources like arXiv, GitHub, PubMed and others, with a mix of custom extraction and cleaning pipelines. According to the authors, this enables the model to have access to more details and nuance, while ensuring formatting remains consistent with the outputs expected.

The reselection and shuffling can take place both after filtering, and also after synthetic data generation. It depends on the design and compute capacity available to the team - these methods can get very expensive if applied to a larger but incorrect dataset.

Takeaway: Reselection determines the implicit priorities encoded in a model’s training

Synthetic Data Generation

There are two main ways to generate synthetic data.

Generator-driven approach: No external datasets are used; the model is simply prompted to produce answers, which then form the dataset. This approach was used by Cosmopedia and Tiny Stories. Evidence shows it can yield performance gains, but only up to a point — improvements decline after a threshold, and there is a potential risk of irrecoverable model collapse (see [paper]).
Source rephrasing approach: High-quality source material is restructured in multiple formats to enhance reasoning. For example, the same Wikipedia article could be rephrased as an academic essay, or as a dialogue between two friends. This approach, used by Apple’s Rephrasing the Web and Nemotron-CC, increases data diversity and improves model performance.

BeyondWeb (focused on pretraining data) and OpenThoughts (focused on posttraining datasets for reasoning) extended these ideas further, analyzing which variables in synthetic data generation matter most.

The first takeaway is that the task of synthetic data should be improving the information density of tokens. Put simply, the more informative and shorter the content, the better. BeyondWeb notes considerable improvements in performance with summaries of content rather than complete passages.

It is worth noting most papers show naive repetition of content will cause performance degradation, and pure synthetic generation only offers modest improvements, but if the synthetic data is carefully curated with the right strategies, there are clear performance improvements. This indicates that the data wall can be scaled with the right synthetic data generation.

The seed quality is also a dominant factor - synthetic data when generated from high quality sources yields considerable gains. Higher quality data tends to contain better information density as well, which goes back to our first takeaway. Both BeyondWeb and OpenThoughts seem to agree on this.

OpenThoughts however had a very interesting point to note - it was not question diversity but answer diversity that yielded better results. Basically, instead of having more questions, they sampled the best questions, and generated multiple answers for each question. The latter approach yielded better results for reasoning - which could indicate that once knowledge is ingrained in the models, they perform better by learning multiple pathways to the same question, and transferring this learning to other questions.

There are still many questions to be asked and answered about the ideal method of curating data for training models - it remains a frontier research problem, and an important one too. Scaling the intelligence wall is only possible if we first scale the data wall and find more efficient architectures and datasets to train at lower costs. It might even enable models capable of continual learning on consumer grade computers.

We’d love to hear what your thoughts are!

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.