Description

Alexander (Sasha) Apartsin, http://apartsin.faculty.ac.il/
alexanderap@hit.ac.il

INTRODUCTION
Many practical text and language processing use cases can now be successfully addressed using modern deep neural networks (DNNs). However, despite their success, deep neural networks require large amounts of labelled data, which is often difficult or impossible to obtain in practice. For instance, in healthcare, classifying rare diseases is particularly challenging due to the scarcity of relevant clinical data and strict privacy constraints. In cybersecurity, detecting zero-day attacks is equally problematic, as these threats are new and lack any prior labelled examples for supervised learning. Luckily, rapid advances in generative large language models (LLMs) now make it possible to produce high-quality synthetic text data quickly, at low cost, and in large quantities.

BACKGROUND
The goal of this project is to build a collection of software packages that can automatically generate synthetic datasets using the most recent advances in generative AI. These packages will rely on state-of-the-art language models and modern development libraries to create realistic and diverse text data. Once developed, the software will be used to produce a broad range of synthetic datasets for training and evaluating AI models in several high-impact domains, including education, software engineering, cybersecurity, and healthcare. These datasets will enable research and experimentation in settings where real data is scarce, sensitive, or costly to obtain.

PROJECT SCOPE
The project team will focus on a set of synthetic data generation tasks to achieve both high realism and broad coverage across different domains. Once robust data generation strategies are developed and the datasets are produced, we will use state-of-the-art pretrained models to establish valid baseline results for each associated task. Each task in the project will follow a structured workflow that includes:
1.	Designing a data generation strategy tailored to the task and domain.
2.	Implementing the corresponding software package.
3.	Conducting data generation experiments to produce and validate the datasets.
4.	Establishing performance baselines using modern pretrained models.
5.	Publishing the software, datasets, and baseline results for community use.
STUDENT REQUIREMENTS
1.	Proficiency in Python programming
2.	Commitment to at least six weekly hours on average

DEVELOPMENT TOOLS
1.	Programming Language: Python 3.x
2.	LLM programming libraries: OpenAI, LangGraph, Ollama, HuggingFace transformers
3.	Development Environment: JupyterLab, VSCode, PyCharm

DELIVERABLES
The final Git repository will include the following components:
1.	Source Code
o	Modular software packages for synthetic data generation
o	Scripts for running data generation, preprocessing, and baseline evaluations
2.	Generated Datasets
o	Synthetic training and evaluation datasets across all target domains
o	Metadata and format descriptions for each dataset
3.	Baseline Results
o	Evaluation scripts and configuration files
o	Performance metrics from off-the-shelf models (e.g., GPT, LLaMA)
o	Comparison plots and result summaries
4.	Documentation
o	User manual with installation and usage instructions
o	Developer guide for extending or modifying the codebase
o	Task-specific data generation strategies and design rationale
Generation of Synthetic Training of Textual Data

AI and Machine Learning

Description

Emphasis in project execution

The project is has cooperation with the industry and combines meeting deadlines while being creative and focused on the task

THE PROJECT IS AT FULL CAPACITY