New! ChatLLM - CodeLLM
Open-Source Generative AI
Abacus.AI is committed to open-source AGI and has significantly contributed to open-source AI and LLMs. Our research is open-sourced, reviewed, and published in top AI and ML conferences.
In addition, our open-source contributions to LLMs have led to several other open-source labs adopting some of our techniques and pushing the boundaries of enterprise and SOTA AI.
Here are the key contributions from Abacus.AI to open-source:
 
Our Models
Abacus.AI Dracarys Line
We have two Dracary fine tunes - One based on Llama-3.1 70b and another based on Qwen 72B
These fine tunes enhance the coding and reasoning abilities of the base LLM. Dracarys is built on a refined formula called the "Dracarys recipe," which applies optimized fine-tuning techniques to large-scale models such as Qwen-2 72B, Qwen2.5-72B, and LLama-3.1 70B.
Our research and experiments on creating this intelligent dataset and fine-tuning methodologies significantly boost the coding and reasoning capabilities of these open-source LLMs. 
According to recent LiveCodeBench benchmarks, Dracarys has shown substantial improvement in code execution and test output prediction scores. Similarly, the Dracarys llama model shows significant improvements on livebench.ai with a whopping score of 36.31 in coding, up from 32.67 (base model score) and 47.33 in reasoning, up from 40.67 (base model score)
Model Reasoning Average Coding Average
claude-3-5-sonnet-20241022 58.67 67.13
claude-3-5-sonnet-20240620 58.67 60.85
qwen2.5-coder-32b-instruct 47.33 56.85
dracarys2-72b-instruct 42.67 56.64
qwen2.5-72b-instruct 46.00 56.56
gemini-exp-1114 54.67 52.36
gpt-4o-2024-08-06 54.67 51.44
Abacus.AI Smaug Line
Smaug is our most recent open-source line of models, and has set a new standard for open-source, topping the HuggingFace OpenLLM leaderboard with an accuracy of 80.48%, nearly 2% better than the next best model. We have introduced several Smaug fine tunes, with the flagship model being Smaug-72B. This model is a fine-tuned version of Qwen-72B, enhanced through our novel Direct Preference Optimization-Positive (DPOP) method.
Unlike traditional DPO, which focuses on improving model performance at the risk of reducing completion likelihood, DPOP introduces a new term in the loss function that penalizes any reduction in the likelihood of positive outcomes. This innovation addresses a critical shortcoming in LLM fine-tuning and significantly improves model reliability and effectiveness.
We also applied these techniques to make Smaug-34B and Smaug-Mixtral, both of which are leaders in performance in their classes.
Smaug-72B - The World’s Best Open-Source LLM!
GPT - 3.5 (PROP) GEMINI PRO (PROP) MISTRAL - SMALL (PROP) MISTRAL - MEDIUM (PROP) SMAUG - 72B (PROP)
MMLU 70.0 71.8 70.6 75.3 77.15
HellaSwag 85.5 84.7 86.7 88.9 89.27
Arc 85.2 unknown 85.8 88.9 76.02
WinoGrade 81.6 unknown 81.2 88 85.05
GSM-8K 57.1 unknown 58.4 66.7 78.7
Truthful QA unknown unknown unknown unknown 76.67
 
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
 
Abacus.AI Giraffe
Open-source LLMs have continued to proliferate within the AI landscape and have shown comparable performance to the close-sourced LLMs of OpenAI, Google, and others. However, open-source LLMs often come with a limited context length, limiting their utility for creating custom LLMs on your knowledge base. Since you are constrained by the amount of proprietary data you can send in a single API call, a bigger context length is crucial for various tasks.
Many methods have been proposed for context-length extrapolation; in our extensive research, we tested each approach thoroughly and proposed two new approaches. One of these approaches is truncation, which has shown promising results.
Along with this research, we released Giraffe. Based on Llama-2, this model is the world’s first open-source LLM capable of handling a 32k context. This capability is vital for various applications, from complex information retrieval to sustained conversational AI and code generation on an existing use case. On an enterprise level, it can function as an AI brain for your business, boosting productivity, improving decision-making, and providing key insights.
 
Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu
arXiv Preprint
Abacus.AI Professor
TheProfessor showcases the innovative potential of merging different LLMs, leveraging their unique strengths to create composite models that excel across various domains. Developed with mergekit using pre-trained language models, TheProfessor provides broad conversational, reasoning, scientific, medical, and mathematical skills.
It is also helpful in concept development, from conception to implementation, including code and writing/reviewing/revising papers with citations.
Abacus.AI Liberated-Qwen
Open-source LLMs are notorious for not following system prompts, which makes them less suited for real-world usage, including being more vulnerable to end user misuse. To fix this critical problem, we introduce Liberated-Qwen1.5-72B, the most performant uncensored model in the world.
Liberated was trained using open-source datasets, including SystemChat, a new dataset we created. (You can read more about this dataset below.) Liberated-Qwen performs the best out of the open-source models on the HumanEval leaderboard. It has an MMLU score of 77+, the best score an open-source model can get.
While this model is entirely uncensored and liberated, it demonstrates strong adherence to system prompt following, and thus allows you to set bounds on its behavior with an appropriate system prompt.
Our Benchmarks
LiveBench AI
With LLMs training on web-scale data, test-set contamination is a pervasive concern in LLM evaluation that can quickly render benchmarks obsolete.  To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and they are not reliable for hard questions. For example, LLM judges make mistakes up to 40% of the time on challenging math and reasoning tasks.
At Abacus.AI, we collaborated with Yann LeCun to create LiveBench AI, the world’s first future-proof LLM Benchmark. LiveBench contains frequently updated questions from recent information sources, and it relies on objective, ground-truth scoring which can’t be gamed. LiveBench contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. The LiveBench leaderboard ranks more than 50 LLMs, including proprietary models and open-source models ranging from 0.5B to 110B in size. LiveBench is challenging, with the best models achieving less than 65% overall accuracy. We will release new, harder tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future.
 
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Schwartz - Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum
Our Datasets
DPO
We used three datasets to create our Smaug series of models. These datasets were meant to be used to fine-tune LLMs using the DPOP loss function. Arc_DPO_FewShot was used to test the level of understanding of science at the grade-school level. HellaSwag_DPO_FewShot contains common sense inference questions that LLMs commonly struggle with. MetaMath_DPO_Fewshot was used to measure math and reasoning skills in an LLM and to align models toward being precise in the calculation.
SystemChat
SystemChat is a dataset with 7000 synthetic conversations generated with Mistral-Medium and Dolphin-2.7-mixtral-8x7b. It was designed to teach model compliance to the system prompt over long multiturn conversations, even with unusual or mechanical system prompts. Fine-tuning with this dataset makes it far more usable and harder to jailbreak. No guardrails or censorship are added to the dataset, and you can implement your own alignment layer.
WikiQA
The WikiQA task is the task of answering a question based on the information given in a Wikipedia document. We have selected large Wikipedia documents and truncated them to get multiple versions of the same document sizes varying between 2000 to 16000 tokens. Each size of document also had multiple versions which places the question and answer text at different locations.
However, a Wikipedia based dataset could correctly answer from its pretrained corpus and not from context. To combat this, we created an “altered” dataset, where the data only consists of questions which have numerical answers. This ensures that if an LLM recollects from its pretrained corpus, it gives a wrong answer.
Other Datasets
We also created MetaMathFewShot, a new few-shot version of the popular MetaMath dataset. This allows the model to understand the concept of few-shot prompting. Our LongChat-Lines was used to evaluate the performance of a model fine-tuned to operate on longer contexts.
Copyright © 2025 Abacus.AI. All Rights Reserved