Warsaw.AI News 29.04-5.05.2024

May 06, 2024

Hello AI Enthusiasts!

Please check the AI news that we found for you in the week of 29.04-5.05.2024:

Vibe-Eval: A New Evaluation Suite for Multimodal Language Models
Vibe-Eval is a challenging and open evaluation suite designed to measure advancements in multimodal language models, focusing on complex image and text interpretation tasks.

Gemma – A Walk-Through of a Modern Transformer Model - a detailed walkthrough of the Gemma transformer model, complete with PyTorch code examples illustrating the implementation of various language processing stages.

KAN: Kolmogorov-Arnold Networks - Inspired by the Kolmogorov-Arnold representation theorem, Kolmogorov-Arnold Networks (KANs) introduce learnable activation functions on edges instead of fixed activation functions on nodes, resulting in improved accuracy and interpretability compared to Multi-Layer Perceptrons (MLPs). KANs demonstrate superior accuracy with smaller models and possess faster neural scaling laws than MLPs, while also offering intuitive visualization and easy interaction with human users, showcasing their potential as collaborators for scientists in discovering mathematical and physical laws.
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation - The article introduces StarCoder2-15B-Instruct-v0.1, the first entirely self-aligned code LLM trained with a permissive and transparent pipeline, enabling fine-tuning without human annotations or data from proprietary LLMs. Achieving a HumanEval score of 72.6, it surpasses CodeLlama-70B-Instruct, demonstrating superior effectiveness in learning from its own distribution compared to data distilled from GPT-4.
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations - Seismic, a novel inverted index organization for fast and effective approximate retrieval over learned sparse embeddings. By organizing inverted lists into geometrically-cohesive blocks with summary vectors, our method achieves sub-millisecond per-query latency on MS MARCO dataset embeddings while outperforming state-of-the-art solutions and winning submissions to the BigANN Challenge at NeurIPS 2023 by significant margin.
A Survey on Vision Mamba: Models, Applications and Challenges - Mamba, a selective structured state space model, offers advanced modeling capabilities akin to Transformers without the quadratic computational complexity, making it suitable for long sequence modeling tasks. Due to its potential as a visual foundation model, researchers are actively exploring its applications in various computer vision tasks, leading to numerous emerging works. This paper provides a comprehensive review of visual Mamba approaches, delineating its formulation, exploring representative backbone networks, categorizing related works across different modalities, and discussing challenges and future research directions in this rapidly evolving area.
Advancing AI’s Cognitive Horizons: 8 Significant Research Papers on LLM Reasoning
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation - This technical report introduces Mamba models trained using contrastive language-image pretraining (CLIP), showcasing their competitive performance in zero-shot classification tasks with parameter efficiency compared to Vision Transformer (ViT) models. While Mamba-based models demonstrate exceptional performance in out-of-distribution (OOD) generalization, a Hessian analysis suggests they have a sharper and more non-convex landscape, posing challenges for training compared to ViT-based models.
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey - This article highlights the growing interest in Vision Transformers (ViTs) as a viable alternative to Convolutional Neural Networks for various vision applications, but notes their deployment challenges due to large model sizes and computational demands. It emphasizes the need for algorithm-hardware co-design to optimize ViT performance, particularly through model quantization, which reduces computational and memory requirements, facilitating the development of specialized hardware accelerators for efficient deployment on resource-constrained devices.
Boosting Segment Anything Model with Adversarial Tuning - This paper presents ASAM, a novel methodology that enhances the performance of the Segment Anything Model (SAM) through adversarial tuning, inspired by successful strategies in natural language processing. By utilizing natural adversarial examples and a stable diffusion model, ASAM achieves significant improvements in various segmentation tasks, setting new benchmarks in the field of computer vision without requiring additional data or architectural modifications to the original SAM.
SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation - This paper introduces SUNDAE, a memory-efficient Gaussian field representation for 3D rendering, addressing the high memory consumption associated with existing methods. SUNDAE utilizes spectral pruning and neural compensation techniques to reduce memory usage while preserving rendering quality, achieving improved performance compared to vanilla Gaussian splatting algorithms on datasets such as Mip-NeRF360.
Better & Faster Large Language Models via Multi-token Prediction - This study proposes training large language models to predict multiple future tokens simultaneously, enhancing sample efficiency and downstream capabilities without increasing training time. By employing multiple output heads for multi-token prediction, the approach demonstrates significant performance improvements on various tasks, particularly on generative benchmarks like coding, where the models consistently outperform strong baselines. Additionally, models trained with 4-token prediction show up to 3 times faster inference speed, even with large batch sizes, offering an additional benefit.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models - LLMs have advanced beyond our capacity to accurately evaluate their quality, prompting the use of LLMs themselves as judges for evaluation. However, this approach, commonly relying on a single large model like GPT-4, is costly and prone to intra-model bias, leading to the proposal of using a Panel of LLM evaluators (PoLL) composed of a larger number of smaller models. Experimentation across various settings and datasets reveals that PoLL outperforms single large judges, reduces intra-model bias, and is significantly less expensive.
Iterative Reasoning Preference Optimization - Iterative preference optimization methods, effective for general instruction tuning tasks, have shown limited improvement on reasoning tasks. Addressing this, we introduce an iterative approach optimizing preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps, resulting in significant accuracy improvements for Llama-2-70B-Chat across reasoning datasets like GSM8K, MATH, and ARC-Challenge, outperforming other Llama-2-based models without additional datasets.
Capabilities of Gemini Models in Medicine - Med-Gemini, a specialized family of multimodal models tailored for medical applications, demonstrates superior performance across a range of benchmarks, surpassing GPT-4 models and achieving new state-of-the-art results in medical text, multimodal tasks, and long-context reasoning. With advancements in areas such as medical text summarization, referral letter generation, and multimodal medical dialogue, Med-Gemini shows promising potential for various applications in medicine, although rigorous evaluation remains essential before real-world deployment.
A Careful Examination of Large Language Model Performance on Grade School Arithmetic - The Grade School Math 1000 (GSM1k) benchmark is introduced to scrutinize the performance of LLMs on elementary mathematical reasoning, aiming to mitigate dataset contamination concerns observed in existing benchmarks like GSM8k. Evaluation of leading LLMs on GSM1k reveals accuracy drops of up to 13%, with some models exhibiting systematic overfitting, while others, particularly Gemini, GPT, and Claude, show minimal signs of overfitting, suggesting potential memorization of the GSM8k dataset among certain models.
ChatNT: A Multimodal Conversational Agent for DNA, RNA, and Protein Tasks - In this paper, ChatNT, a multimodal conversational agent, is introduced, bridging the gap between biological foundation models and conversational agents by demonstrating advanced understanding of biological sequences and achieving new state-of-the-art results on the Nucleotide Transformer benchmark. ChatNT's capability extends to solving multiple tasks simultaneously in English, generalizing to unseen questions, and achieving performance on par with specialized methods on curated biological tasks, presenting a significant advancement in the integration of conversational agents in biology.
A Careful Examination of Large Language Model Performance on Grade School Arithmetic - Large language models (LLMs) have demonstrated strong performance on mathematical reasoning benchmarks, but concerns arise regarding dataset contamination influencing results rather than genuine reasoning ability. To address this, Grade School Math 1000 (GSM1k) is introduced as a rigorous evaluation benchmark mirroring GSM8k, aiming to assess elementary mathematical reasoning accurately. Evaluating leading LLMs on GSM1k reveals accuracy drops, indicating potential overfitting issues in certain model families, while further analysis suggests a correlation between a model's ability to generate examples from GSM8k and its performance gap between GSM8k and GSM1k, hinting at potential memorization of GSM8k.

OpenGVLab Introduces InternVL-Chat-V1.5 - InternVL-Chat-V1.5 is a multimodal language model that leverages advanced visual encoding techniques for understanding and interacting across languages, focusing on OCR and Chinese language processing tasks.
Llama3-OpenBioLLM-8B: AI Model Paving New Pathways in Bioinformatics -
Llama3-OpenBioLLM-8B is an 8-billion parameter language model specifically built for bioinformatics tasks, providing advanced capabilities for analyzing biological and medical data.
Medical Language Model Leaderboard – Open Medical-LLM Leaderboard -
The Open Medical-LLM Leaderboard is a Hugging Face Space that showcases and compares the performance of language models designed for medical applications, facilitating the tracking of advancements in this field.
ByteDance Introduces Hyper-SD, an Innovative Diffusion Acceleration Model -
ByteDance's Hyper-SD is an advanced diffusion model that employs acceleration techniques for enhanced image generation quality, compatible with various SDXL and SD1.5 model versions.

Luminal – A High-Performance Deep Learning Library - Luminal is a deep learning library designed to compile complex code to maximize performance across various hardware platforms.
A PyTorch Library for Large Model Training - Torchtitan is a new PyTorch library focused on efficiently training large-scale language models using cutting-edge distributed techniques.
OpenLIT is an open-source GenAI and LLM observability platform native to OpenTelemetry with traces and metrics in a single application. Open source GenAI and LLM Application Performance Monitoring (APM) & Observability tool.
Memary: Enhanced Long-Term Memory for Autonomous Agents - Memary is an open-source system that provides long-term memory for autonomous agents, utilizing knowledge graphs for storing and processing information.
Open Interpreter 01: A New Open Source Language Model Computer -
The OpenInterpreter 01 project is an open-source operating system for conversational devices, inspired by GNU/Linux, aimed at full autonomy and modularity in voice interactions. It can power conversational devices like the Rabbit R1, Humane Pin, or Star Trek.
Borgo: A New Programming Language Compiling to Go - Borgo is a statically typed programming language that blends Go's simplicity with Rust's expressiveness, compiling to Go and featuring capabilities like algebraic data types and error handling.
NVIDIA Introduces cuda-checkpoint for Managing CUDA State - NVIDIA's cuda-checkpoint is a tool that enables checkpointing and restoring CUDA state in Linux processes, working with CRIU for full CUDA application checkpointing functionality.
Meta-Prompting: A New Technique to Enhance Language Models -
Meta-Prompting is a scaffolding technique that transforms a single language model into a multi-tasking "conductor", enhancing its effectiveness in task resolution.
QUICK: Enhancing LLM Inference with Advanced CUDA Kernels -
QUICK from SqueezeBits introduces advanced CUDA kernels that boost inference performance for quantized large language models by minimizing conflicts and enhancing throughput.
Phospho: Text Analytics Platform for LLM Apps - Phospho is a text analytics platform that enables issue detection and insight extraction from text messages, integrated with LLMs like OpenAI and MistralAI.
FlowTest: Open Source IDE for API-First Workflows - FlowTest is an AI-powered Integrated Development Environment (IDE) designed to craft, visualize, and manage API-first workflows, featuring fast, lightweight architecture and localized operations.
XTuner: A Toolkit for Efficiently Fine-Tuning Large Models - XTuner is a comprehensive toolkit for fine-tuning large language models, enabling fast and flexible performance optimization using various techniques including DeepSpeed and QLoRA.
Luminal – A High-Performance Deep Learning Library - Luminal is a deep learning library designed to compile complex code to maximize performance across various hardware platforms.
DeepFaceLive: Real-Time Face Swap Tool for PC Streaming - DeepFaceLive is an application that enables real-time face swapping during live streams or video calls, using pre-trained face models or user-customized models.
LLM Datasets: Tools and Resources for Fine-Tuning Large Language Models - The LLM Datasets repository offers datasets, tools, and concepts for tuning and enhancing the performance of large language models, supporting their development across various applications.
RALM_Survey: A Review of Retrieval-Augmented Language Models - RALM_Survey is a repository that provides a review of the latest technologies and research in retrieval-augmented language models, applied in natural language processing.
Long-Context Data Engineering: Data Engineering for Scaling Language Models to 128K Context - The Long-Context Data Engineering repository showcases implementations designed to scale language models up to 128K context lengths, enhancing their processing capabilities.

ExecuTorch Alpha: Advancing LLM Deployment on Mobile Devices -
ExecuTorch Alpha, a new release from PyTorch, enables deployment of large language models on mobile devices with enhanced quantization features and collaborations with companies like Apple and Qualcomm.
Apple to Unveil AI-Enabled Safari Browser Alongside New Operating Systems -
Apple is introducing an AI-enabled Safari browser featuring intelligent search and content customization, set to debut with the upcoming releases of iOS 18 and macOS 15.
GitHub Copilot Workspace: A New Development Environment Tailored for Coders - GitHub Copilot Workspace is a new development platform that enables developers to craft complex code changes through interactive specification and action plan modeling in natural language.
Amazon Q – A New Era of AI Assistants for Businesses and Developers - Amazon Q, newly announced by AWS, aims to accelerate software development and optimize the use of internal business data, offering advanced capabilities for code generation, debugging, and planning.
AI21 Labs Introduces Jamba-Instruct, a New AI Model for Enterprises - Jamba-Instruct is an instruction-tuned version of the Jamba model from AI21 Labs, designed for reliable commercial use with enhanced enterprise features.
Evaluating the Helpfulness of AI-Enhanced Catalogue Data at Amazon - Amazon assesses the effectiveness of AI-enhanced catalogue data, employing advanced machine learning models to enhance the accuracy and consistency of product information.

ChatGPT Plus Introduces Memory in Chatbot Subscription - ChatGPT Plus, an upgraded subscription version, allows users to customize the chatbot by remembering preferences and interactions, aiming to create a more personalized assistant.
"Wake Up, Europe!" – A Call for Reducing Bureaucracy and Fostering Innovation - Andreas Klinger advocates for reforming Europe's regulatory framework to facilitate investments and startup growth, emphasizing the importance of a unified jurisdiction and early language education.
How Field AI Is Conquering Unstructured Autonomy - Field AI, founded by a former NASA team leader, is deploying robots capable of operating in unknown, unstructured environments without the need for maps or human oversight.
A.I. Start-Ups Face a Rough Financial Reality Check - AI startups are encountering significant financial difficulties, forcing even prominent companies to restructure and cut costs amid huge expenses and uncertain revenue streams.
Major U.S. Newspapers Sue OpenAI and Microsoft for Copyright Infringement - Eight leading U.S. newspapers owned by investment giant Alden Global Capital have filed a lawsuit against OpenAI and Microsoft, accusing them of copyright infringement by using their articles to train AI models.
Microsoft bans US police departments from using enterprise AI tool for facial recognition - Microsoft has updated the terms of service for Azure OpenAI Service, explicitly prohibiting its use by U.S. police departments for facial recognition.
Revenue Models for AI Apps in Startups - The article examines the revenue models of AI applications, noting a lack of pricing innovation and the popularity of subscription models among AI startups.

Enjoy!

Warsaw.AI News Team

P.S.: This newsletter is free, but if you enjoy it, you are welcome to donate to help us make it even better (the readers paying taxes in Poland can also contribute 1.5% of their taxes to support us).