Warsaw.AI News 24-30.06.2024

Jul 02, 2024

Hello AI Enthusiasts!

We invite you to check the AI news that we found for you in the week of 24-30.06.2024:

“Researchers upend AI status quo by eliminating matrix multiplication in LLMs” - Researchers have devised a method to enhance the efficiency of AI models by eliminating the need for matrix multiplication, a core redesign of neural network operations typically accelerated by GPUs.
“How to think about creating a dataset for LLM finetuning evaluation” - A summary of evaluations required for a task involving structured data generation.

“Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control” - Bigger, regularized, optimistic (BRO) is a novel state of the art model-free Reinforcement Learning agent for continuous control tasks. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO was benchmarked on a variety of 40 tasks from DeepMind Control Suite, Metaworld and MyoSuite achieving better performance than currect state of the art model-based agent TD-MPC2 and other strong model-free baselines.
“EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations” - The paper introduces EvalAlign, a precise and stable evaluation metric for text-to-image generative models, addressing the lack of fine-grained metrics in the field. EvalAlign uses supervised fine-tuning of Multimodal Large Language Models and detailed evaluation protocols to closely align with human judgments, demonstrating superior metric stability and alignment with human preferences compared to existing metrics.
“FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models“ - The paper introduces FreeTraj, a tuning-free framework for trajectory-controllable video generation using diffusion models, eliminating the need for additional training. By guiding noise construction and attention computation, FreeTraj allows users to manually input or use automatically generated trajectories, demonstrating enhanced control over video trajectories through extensive experiments.
“GeoMFormer: A General Architecture for Geometric Molecular Representation Learning“ - The paper introduces GeoMFormer, a novel Transformer-based molecular model designed to learn invariant and equivariant features for accurate molecular property prediction and behavior simulation. By developing two separate streams for these representations and using cross-attention modules for information fusion, GeoMFormer demonstrates strong performance across various tasks and scales, showcasing its flexibility and effectiveness in molecular modeling.
“Retrieval Augmented Instruction Tuning for Open NER with Large Language Models“ - The paper explores Retrieval Augmented Instruction Tuning (RA-IT) for improving open named entity recognition (NER) by using semantically similar examples as context in large language models (LLMs). RA-IT's effectiveness is demonstrated through experiments in both English and Chinese scenarios, showing significant improvements in information extraction tasks and exploring various retrieval strategies within the framework.
“Data curation via joint example selection further accelerates multimodal learning“ - The paper introduces multimodal contrastive learning with joint example selection (JEST), which improves large-scale pretraining by selecting batches of data jointly rather than independently. JEST significantly accelerates training and reduces computational overhead by using multimodal contrastive objectives to measure batch learnability and leveraging model approximation techniques, resulting in superior performance with fewer iterations and less computation compared to state-of-the-art models.
“Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text” - Director3D introduces a novel text-to-3D generation framework utilizing real-world datasets to create realistic 3D scenes and adaptive camera trajectories. By employing a Trajectory Diffusion Transformer for camera modeling, a Gaussian-driven Multi-view Latent Diffusion Model for scene representation, and refining with SDS++ loss, Director3D significantly outperforms existing methods in generating high-quality 3D scenes from textual descriptions.
“Meta Large Language Model Compiler: Foundation Models of Compiler Optimization“ - The paper introduces Meta LLM Compiler, a set of pre-trained models designed for code optimization tasks, enhancing understanding of compiler intermediate representations, assembly language, and optimization techniques. Trained on 546 billion tokens of LLVM-IR and assembly code, the models demonstrate significant potential in optimizing code size and converting assembly back to LLVM-IR, achieving notable results in disassembly and optimization efficiency, providing a cost-effective foundation for further research in compiler optimization.
“ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data” - The paper introduces ALPBench, a standardized benchmark for evaluating active learning pipelines, facilitating reproducible performance comparisons across different query strategies and learning algorithms. ALPBench includes 86 real-world tabular classification datasets and 5 active learning settings, supporting comprehensive studies such as evaluating 9 query strategies with 8 learning algorithms, addressing the community's need for systematic benchmarking in active learning research.
“Point-SAM: Promptable 3D Segmentation Model for Point Clouds” - The Segment Anything Model (SAM) has significantly advanced 2D foundation models for image segmentation, but extending this success to 3D models faces challenges like data format inconsistencies, model efficiency, and a lack of diverse labeled data. Addressing these challenges, they authors proposed Point-SAM, a transformer-based 3D promptable segmentation model for point clouds. Leveraging part-level and object-level annotations, their approach incorporates a data engine to generate pseudo labels from SAM, effectively transferring 2D knowledge to enhance performance in 3D applications, surpassing current benchmarks in both indoor and outdoor settings and demonstrating its versatility in tasks like 3D annotation.
“MatText: Do Language Models Need More than Text & Scale for Materials Modeling?“ - LLMs have shown significant success across various domains, yet their application in materials science remains relatively unexplored, hindered by challenges such as the effective representation of materials as text and the absence of comprehensive benchmarks. To address this, MatText introduces nine text-based representations for materials, each designed with specific biases to integrate physical knowledge, alongside tools for rigorous evaluation. Despite advancements, current language models struggle to consistently capture geometric details crucial for materials modeling, highlighting the need for improved text-based methods in this domain as emphasized by MatText's findings.
“The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale” - In this study, the creation of high-performance large language models (LLMs) such as Llama 3 and Mixtral hinges on the quality and scale of their pretraining datasets, yet these datasets are not publicly available. To address this gap, FineWeb, a 15-trillion token dataset derived from 96 snapshots of Common Crawl, is introduced, demonstrating superior LLM performance. The study meticulously examines and documents the design choices behind FineWeb, including deduplication and filtering strategies, and releases FineWeb-Edu, a subset optimized for educational content, showing significant improvements on knowledge and reasoning benchmarks like MMLU and ARC.
“Symbolic Learning Enables Self-Evolving Agents“ - Current research into AGI focuses on developing language agents—sophisticated pipelines of LLMs that utilize prompting techniques and tool usage methods. However, a significant limitation is their dependence on manual engineering efforts rather than autonomous learning from data. This study introduces agent symbolic learning, a framework enabling language agents to autonomously optimize themselves in a data-centric manner using symbolic optimizers, mimicking back-propagation and gradient descent with natural language simulacrums. Proof-of-concept experiments demonstrate that this approach facilitates the evolution of "self-evolving agents" capable of updating themselves post-deployment, representing a step towards achieving AGI.
“APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets” - This paper introduces APIGen, an automated pipeline for synthesizing reliable datasets essential for function-calling agent models. Using APIGen, the study collects and verifies 3,673 executable APIs across 21 categories, demonstrating that models trained on these datasets achieve state-of-the-art performance on benchmarks, surpassing models with higher parameters like GPT-4 and GPT-3.5-Turbo. The dataset, containing 60,000 high-quality entries, is released to advance research in function-calling agent domains.
“OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?“ - This report evaluates the intelligence of the latest AI models—Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o—using an Olympic medal table approach across diverse disciplines in the OlympicArena benchmark. Claude-3.5-Sonnet demonstrates strong overall performance, occasionally surpassing GPT-4o in subjects like Physics, Chemistry, and Biology, while Gemini-1.5-Pro and GPT-4v follow closely but show a notable performance gap. The study highlights that open-source AI models currently lag behind proprietary ones in achieving superintelligent capabilities, indicating ongoing challenges in AI advancement.
“MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data“ - The study introduces MUMU, a model trained to generate images from prompts containing interleaved text and images, demonstrating its ability to compose inputs from different images into coherent outputs like transferring styles or maintaining character consistency across prompts. Despite being trained on image crops from the same dataset, MUMU achieves generalization across tasks, showcasing the potential of multimodal models as versatile tools for image generation and manipulation.
“Suri: Multi-constraint Instruction Following for Long-form Text Generation” - This study addresses the limitation of existing instruction-following research by focusing on generating long-form text under multi-constraint instructions. Introducing the Suri dataset, which pairs human-written long-form texts with LLM-generated complex instructions, the research proposes Instructional ORPO (I-ORPO) for alignment, adapting ORPO's principles by using negative feedback from synthetically corrupted instructions. Evaluations on Mistral-7b-Instruct-v0.2 demonstrate that models fine-tuned with Suri using I-ORPO can produce longer texts (~5K tokens) while maintaining coherence and preferred quality in incorporating complex constraints, compared to models trained with standard supervised fine-tuning.
“DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability” - This study introduces DEX-TTS, an advanced acoustic model for expressive Text-to-Speech (TTS) that enhances style representation and generalization capabilities using a diffusion-based framework. DEX-TTS integrates specialized encoders and adapters to effectively handle time-invariant and time-variant styles extracted from reference speech, achieving superior performance in both objective and subjective evaluations across English multi-speaker and emotional datasets without pre-training, demonstrating its effectiveness compared to standard TTS models on single-speaker datasets.

Florence-2 - a foundation vision-language model released by Microsoft. It has a small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks.
Gemma-2 - Available in both 9 billion and 27 billion parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in.
MARS5 - a voice cloning model with support for 140 languages.
Cambrian-1 - a family of vision-centric multimodal LLMs.
CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF

https://github.com/openpsi-project/realhf - Distributed system designed for efficient RLHF training with LLMs. RealHF dynamically redistributes LLM parameters across the cluster and minimizes redundant communication while maximizing GPU utilization.
https://github.com/karpathy/LLM101n - Incoming another great course (currently under development) by Andrei Karpathy about creating LLMs from scratch. Methods that will be covered include transformers, fine-tuning, RL and many more.
https://github.com/nus-apr/auto-code-rover - Fully automated approach for resolving GitHub issues (bug fixing and feature addition) where LLMs are combined with analysis and debugging capabilities to prioritize patch locations ultimately leading to a patch.
https://github.com/nlkitai/nlux - React and JavaScript open-source library for building conversational AI interfaces. It makes it super simple to build web applications powered by LLMs and AI.
https://github.com/mezbaul-h/june - Local voice chatbot that combines the power of Ollama (for language model capabilities), Hugging Face Transformers (for speech recognition), and the Coqui TTS Toolkit (for text-to-speech synthesis).
https://github.com/ItzCrazyKns/Perplexica/Open-source alternarive to Perplexity.ai - an AI powered search engine. Can use local LLMs using Ollama library.
https://github.com/apple/ml-4m - 4M: Massively Multimodal Masked Modeling. A framework for training any-to-any multimodal foundation models. Scalable. Open-sourced. Across tens of modalities and tasks.
https://github.com/BasedHardware/OpenGlass - Open Source Smart Glasses, advertized as “Turn any glasses into AI-powered smart glasses”.
RES-Q is a codebase editing benchmark consisting of 100 hand-crafted, compact natural language edit instructions. The task is to, given an edit instruction and a codebase, make an edit to the codebase that satisfies the instruction.
The Prompt Engineering Tool - it is a web-based application designed to help users experiment with and optimize prompts for various LLMs.
R2R is a prod-ready RAG engine with a RESTful API. R2R includes hybrid search, knowledge graphs, and more.
https://github.com/run-llama/llama-agents - it is an async-first framework for building, iterating, and productionizing multi-agent systems, including multi-agent communication, distributed tool execution, human-in-the-loop, and more!
https://github.com/SkalskiP/top-cvpr-2024-papers - This repository is a curated collection of some interesting CVPR 2024 papers.

Building a personalized code assistant with open-source LLMs using RAG Fine-tuning.
Training MoEs at Scale with PyTorch.
Text2Bricks: Fine-tuning Open-Sora in 1,000 GPU Hours.

FFplus open call (for startups and SMEs) for the development of generative AI models. And here is the call for business experiments.
An article about Investing in the Age of Generative AI