German language capabilities for state-of-art open-source models
In the realm of artificial intelligence, large foundation models such as OpenAI's GPT-3.5 and GPT-4, alongside Anthropic’s models, have set a high standard for multi-lingual text generation capabilities.
The Challenge

In the realm of artificial intelligence, large foundation models such as OpenAI's GPT-3.5 and GPT-4, alongside Anthropic’s models, have set a high standard for multi-lingual text generation capabilities. However, a notable challenge arises within the German market, where industry clients often express a preference for building AI solutions atop open-source Large Language Models (LLMs) that afford the possibility of local and private hosting and deployment. This requirement is primarily driven by the need to process proprietary data securely, without the risk of exposing it to the internet. Additionally, clients seek the autonomy to manage the fine-tuning and alignment of these models to tailor them precisely to their needs.

Generating text in the German language

In the realm of artificial intelligence, large foundation models such as OpenAI's GPT-3.5 and GPT-4, alongside Anthropic’s models, have set a high standard for multi-lingual text generation capabilities. However, a notable challenge arises within the German market, where industry clients often express a preference for building AI solutions atop open-source Large Language Models (LLMs) that afford the possibility of local and private hosting and deployment. This requirement is primarily driven by the need to process proprietary data securely, without the risk of exposing it to the internet. Additionally, clients seek the autonomy to manage the fine-tuning and alignment of these models to tailor them precisely to their needs.

The Solution

In response to the challenge of enhancing German language generation in open-source Large Language Models (LLMs), we embarked on an initiative to comprehensively train several major open-source LLMs. This training utilized a substantial dataset specifically designed for German instruction tuning, aiming to elevate their performance across a multitude of tasks when generating German text.Recognizing the broader impact and potential of our work, we made a deliberate choice to share both the improved models and the datasets publicly on Hugging Face. This decision was rooted in our firm belief in the transformative power of open-source AI technology and its capacity to benefit humanity at large. By making these resources available to the public, we contribute to the ongoing development and refinement of AI technologies, igniting a community of collaboration and innovation.

Step towards more accessible and democratized AI

The models, optimized through our efforts, are not only adept at high-level German text generation but are also characterized by their relatively small size. This compactness does not come at the expense of performance; rather, these high-performing LLMs represent a significant step towards more accessible and democratized AI. They are designed to require less computational power for both training and inference, making them ideally suited for integration into a variety of applications, including small personal AI assistants. Such assistants, capable of running locally on personal computers, exemplify our vision for AI technology: powerful, efficient, and within reach of individual users, thereby advancing the field towards greater inclusivity and utility.

AI Tech Stack

• AWS Sagemaker for hardware access

• Deepspeed for distributed training setup

• vLLM for fast and efficient inference

• Huggingface datatrove for data pre-processing tasks

• Mergekit for merging multiple models to stabilize the overall performance