Seedbox

Challenge

In the realm of artificial intelligence, large foundation models such as OpenAI's GPT-3.5 and GPT-4, alongside Anthropic’s models, have set a high standard for multi-lingual text generation capabilities. However, a notable challenge arises within the German market, where industry clients often express a preference for building AI solutions atop open-source Large Language Models (LLMs) that afford the possibility of local and private hosting and deployment. This requirement is primarily driven by the need to process proprietary data securely, without the risk of exposing it to the internet. Additionally, clients seek the autonomy to manage the fine-tuning and alignment of these models to tailor them precisely to their needs.

A significant obstacle is encountered here: the majority of state-of-the-art open-source models exhibit limited capabilities in generating text in the German language. This limitation poses a critical challenge for deploying AI solutions that meet the specific demands of the German-speaking market. The question then becomes: how can we elevate the German language capabilities of smaller open-source language models to match the quality offered by larger foundation models? This challenge not only highlights the gap in current open-source AI offerings but also underscores the necessity for innovative solutions that can bridge this divide, ensuring that German-language AI applications can achieve the same level of excellence and applicability as their counterparts in more widely supported languages.

Solution

In response to the challenge of enhancing German language generation in open-source Large Language Models (LLMs), we embarked on an initiative to comprehensively train several major open-source LLMs. This training utilized a substantial dataset specifically designed for German instruction tuning, aiming to elevate their performance across a multitude of tasks when generating German text.

Recognizing the broader impact and potential of our work, we made a deliberate choice to share both the improved models and the datasets publicly on Hugging Face. This decision was rooted in our firm belief in the transformative power of open-source AI technology and its capacity to benefit humanity at large. By making these resources available to the public, we contribute to the ongoing development and refinement of AI technologies, igniting a community of collaboration and innovation.

The models, optimized through our efforts, are not only adept at high-level German text generation but are also characterized by their relatively small size. This compactness does not come at the expense of performance; rather, these high-performing LLMs represent a significant step towards more accessible and democratized AI. They are designed to require less computational power for both training and inference, making them ideally suited for integration into a variety of applications, including small personal AI assistants. Such assistants, capable of running locally on personal computers, exemplify our vision for AI technology: powerful, efficient, and within reach of individual users, thereby advancing the field towards greater inclusivity and utility.

Link to models

AI Tech Stack

AWS Sagemaker for hardware access
Deepspeed for distributed training setup
vLLM for fast and efficient inference
Huggingface datatrove for data pre-processing tasks
Mergekit for merging multiple models to stabilize the overall performance

‍

DATASET CREATION

Large-Scale Translation

In our journey to enhance German language capabilities in open-source LLMs, we built upon on a large-scale translation of cutting-edge instructional datasets. Our methodology adopted a multi-step translation process, beginning with initial translations that were then stepwise improved for quality. This iterative process allowed us to refine the translations, ensuring a high level of accuracy and relevance.

Pre-processing

A crucial step in our dataset preparation was the semantic deduplication and the removal of any potentially harmful data. This step was essential in maintaining the integrity and safety of the training material, ensuring that the resulting AI models would generate responsible and ethical outputs.

Synthetic Data Generation

To further enrich our dataset, we utilized LLMs to create synthetic data that was specifically tailored to embody German linguistic characteristics. This bespoke data was crucial for the models to achieve a nuanced understanding and adaptation to the German language. The compilation of this data, along with the implementation of the latest chat prompt template techniques, constituted a comprehensive and robust dataset for training purposes.

‍

TRAINING

Our training approach was characterized by an emphasis on hardware efficiency and the prevention of catastrophic forgetting. We employed parameter-efficient fine-tuning methods to achieve this, ensuring that our models remained lean and effective.

Utilizing the latest Laser-QLora technique, we pinpointed those layers within the models that exhibited the highest signal-to-noise ratio for fine-tuning. This strategic approach developed from an international research group allowed us to optimize the models' learning process, enhancing their language generation capabilities.

Finally, upon completing the training of multiple adapters tailored to a diverse portfolio of tasks, we employed MergeKit. This tool enabled us to amalgamate our models with others, a process that significantly stabilized and improved overall performance. Through this innovative approach, we succeeded in crafting models that not only excel in German language generation but also set a new standard for efficiency and adaptability in the AI domain.