Warsaw.AI News 10-16.06.2024
Hello AI Enthusiasts!
Please check the AI news that we found for you in the week of 10-16.06.2024:
Andrei Karpathy’s tutorial on how to reproduce the GPT-2.
Generative AI for Beginners: Course by Microsoft - Microsoft provides a "Generative AI for Beginners" course with 18 lessons covering the basics of building generative AI applications, featuring code examples in Python and TypeScript, and access to Azure OpenAI and OpenAI API services.
What If We Recaption Billions of Web Images with LLaMA-3? - Web-crawled image-text pairs are inherently noisy, but prior studies show that semantically aligning and enriching textual descriptions can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. This paper introduces a simple recaptioning pipeline using the open-sourced LLaMA-3 to fine-tune LLaVA-1.5 and recaption 1.3 billion images, resulting in the Recap-DataComp-1B dataset, which demonstrates substantial benefits in training advanced vision-language models and improving zero-shot performance in cross-modal retrieval tasks for discriminative models and text alignment for generative models.
Hearing Anything Anywhere - Recent advancements in 3D computer vision and graphics have enabled the virtualization of real-world 3D environments for Mixed Reality applications, but immersive auditory experiences are equally essential. This paper introduces DIFFRIR, a differentiable RIR rendering framework that reconstructs spatial acoustic characteristics using a sparse set of room impulse response recordings and a planar scene reconstruction, outperforming state-of-the-art baselines and learning interpretable acoustic parameters for novel auditory experiences.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling - Efficiently modeling sequences with infinite context length has been challenging due to either high computation complexity or limited length generalization. The new hybrid architecture, Samba, which combines Mamba (a selective State Space Model) with Sliding Window Attention, addresses these issues by compressing sequences into recurrent hidden states while maintaining precise memory recall, achieving significant performance improvements and higher throughput compared to state-of-the-art models.
Improving Alignment and Robustness with Circuit Breakers - AI systems are prone to harmful actions and adversarial attacks, but the new approach of using "circuit breakers" directly interrupts harmful outputs by controlling the responsible representations. Unlike refusal and adversarial training, this method can be applied to both text-only and multimodal models, effectively preventing harmful outputs and withstanding powerful attacks, demonstrating significant reductions in harmful actions by AI agents under attack.
Compute Better Spent: Replacing Dense Layers with Structured Matrices
Dense linear layers are a major computational bottleneck in foundation models, but structured matrices offer a more compute-efficient alternative. This work introduces Block Tensor-Train (BTT) matrices, demonstrating that they outperform dense matrices in multiple tasks, achieving better training loss on CIFAR-10/100 and matching dense performance on ImageNet-1k with significantly less compute, as well as proving more efficient for small GPT-2 language models.CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models - Artificial intelligence has revolutionized medical applications, particularly with Medical Large Vision Language Models (Med-LVLMs), promising advances in automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, and this paper introduces CARES to evaluate their trustworthiness across five dimensions—trustfulness, fairness, safety, privacy, and robustness—revealing significant concerns, including factual inaccuracies and demographic biases, which pose risks for future deployment.
Generalizable Human Gaussians from Single-View Image - This work addresses the challenge of learning generalizable 3D human models from a single image, focusing on recovering detailed geometry and appearance for unobserved regions. The proposed Human Gaussian model (HGM) uses a diffusion-guided framework and incorporates geometric priors from the SMPL model, significantly surpassing state-of-the-art methods in PSNR and SSIM while exhibiting strong generalization performance for in-the-wild images.
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis - Volumetric rendering methods like NeRF excel in HDR view synthesis from RAW images but are hindered by long training times and the inability to perform real-time rendering. LE3D (Lighting Every darkness with 3DGS) addresses these issues by enhancing structure-from-motion estimation, using a Color MLP for RAW linear color space, and improving scene structure accuracy, enabling real-time novel view synthesis and significantly reducing training time and improving rendering speed compared to previous methods.
Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations - Diabetic Retinopathy (DR) poses a significant risk of visual impairment, and accurate grading is essential for effective treatment. Existing models struggle with performance degradation on unseen domains due to domain shifts, which encompass more than just image styles and include biases like ethnicity and age. This work proposes a novel framework that decouples representations of paired data into semantic features and domain noise, creating augmented representations that align with real-world clinical needs. Using class and domain prototypes to interpolate these representations and a robust pixel-level semantic alignment loss, the method enhances robustness and effectiveness across diverse domains, as demonstrated by experimental results on multiple benchmarks.
TextGrad: Automatic "Differentiation" via Text - AI is undergoing a paradigm shift with breakthroughs achieved by orchestrating multiple large language models (LLMs) and complex components, highlighting the need for principled and automated optimization methods for compound AI systems. TextGrad, inspired by backpropagation in neural networks, addresses this by using LLMs to provide natural language feedback for optimizing variables in computation graphs, demonstrating significant improvements across diverse applications such as question answering, molecule optimization, and radiotherapy treatment planning.
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising - Diffusion models are renowned for their generative abilities but suffer from high cumulative latency due to their multi-step sequential denoising, limiting parallel computation. AsyncDiff addresses this by enabling model parallelism across multiple devices, transforming sequential denoising into an asynchronous process, significantly reducing inference latency while maintaining generative quality, achieving a 2.7x speedup with minimal impact and a 4.0x speedup with a slight reduction in CLIP Score on Stable Diffusion v2.1.
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models - Autoregressive Large Language Models (LLMs) excel in language tasks but struggle with quadratic complexity in attention modules and limited efficiency due to sequential processing. This study explores the integration of linear attention methods with speculative decoding, introducing an augmentation technique for linear attention to enhance training and serving efficiency. Experiments with seven linear attention models and five LLMs show significant improvements, achieving up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2× speedup during generation.
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning - Husky is an open-source language agent designed for diverse, complex tasks like numerical, tabular, and knowledge-based reasoning. It uses a unified action space to generate and execute actions, outperforming existing agents across 14 datasets and achieving comparable or superior results to advanced models like GPT-4 on mixed-tool reasoning challenges, demonstrated by its new evaluation set, HuskyQA.
Automatic “Differentiation” via Text - TextGrad introduces a novel framework that applies automatic differentiation via textual feedback from LLMs, enhancing components within compound AI systems. Modeled after the transformative impact of backpropagation in neural networks, TextGrad leverages natural language suggestions to optimize various tasks seamlessly, from code generation to medical treatment planning, demonstrating significant performance gains across diverse applications. This approach not only enhances GPT-4o's zero-shot accuracy in Google-Proof Question Answering and LeetCode-Hard coding problems but also advances reasoning prompts and designs for molecular structures and radiation oncology, promising to accelerate the evolution of AI systems.
Transformers meet Neural Algorithmic Reasoners - The integration of Transformers and graph neural networks (GNNs) in TransNAR addresses the fragility of standard Transformers in algorithmic reasoning tasks by leveraging the robustness of GNN-based neural algorithmic reasoners (NARs). This hybrid architecture, achieved through a two-phase training procedure, enables effective cross-attention between Transformer tokens and NAR node embeddings, significantly enhancing performance on algorithmic reasoning tasks like CLRS-Text compared to Transformer-only models.
The Fine-Tuning Index: Performance of Open-Source LLMs - Predibase presents the Fine-Tuning Index, comparing the performance of 700+ open-source LLMs before and after fine-tuning, showing that fine-tuned models outperform GPT-4 in 85% of tasks, offering economical and fast solutions on dedicated GPUs.
Nvidia Conquers Latest AI Tests: GPU Maker Tops New MLPerf Benchmarks - Nvidia dominates the new MLPerf tests, setting records in benchmarks for fine-tuning large language models and graph neural networks with a system comprising 11,616 H100 GPUs, demonstrating exceptional scalability and software optimization.
LiveBench: Real-Time AI Model Performance Evaluation Tool
LiveBench is a platform for testing and evaluating AI model performance in real-time, allowing users to monitor, compare, and optimize various models for maximum efficiency and results.Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
The article introduces the "Test of Time" benchmark for assessing large language models' (LLMs) capabilities in temporal reasoning, using synthetic datasets to systematically analyze model performance across various scenarios.
Real-time Whisper WebGPU: Real-time Transcription on Hugging Face
Real-time Whisper WebGPU is a Hugging Face application by Xenova that enables real-time speech transcription using WebGPU technology, providing fast and accurate audio processing directly in the browser.Introducing Stable Diffusion 3 Medium: The Most Advanced Image Generation Model - Stability AI announces the release of Stable Diffusion 3 Medium, the latest text-to-image model with 2 billion parameters, providing photorealistic images, improved prompt understanding, and optimized performance on consumer GPUs.
FineWeb-Edu Classifier: Educational Content Classifier for Web Pages
The FineWeb-Edu Classifier evaluates the educational value of web pages using annotations generated by the Llama3-70B-Instruct model to filter and curate educational web content, achieving an F1 score of 82% in binary classification.
Inspectus: LLM Analytics Tool - Inspectus is a versatile tool for visualizing and analyzing large language models, running smoothly in Jupyter notebooks with an easy-to-use Python API. It allows for visualization of attention matrices, query and key token heatmaps, and dimension analysis. Inspectus supports various models, providing tools for analyzing model behaviors and visualizing results.
Spreadsheet Is All You Need: NanoGPT Pipeline in a Spreadsheet - "Spreadsheet Is All You Need" is a project presenting the nanoGPT pipeline in a spreadsheet, allowing for visualization and understanding of transformer workings. It includes all transformer components like embedding, layer norm, self-attention, and MLP, fully interactive and configurable. The project is based on Andrej Karpathy's nanoGPT structure.
Marker: Fast and Accurate PDF to Markdown Conversion - Marker is a CLI tool for quickly and accurately converting PDFs to Markdown, supporting a wide range of documents, formatting tables and code blocks, and converting equations to LaTeX. The tool uses a pipeline of deep learning models for OCR and page layout detection, with support for GPU, CPU, and MPS.
Code2Prompt: CLI Tool for Generating LLM Prompts from Codebase - Code2Prompt is a CLI tool that converts your codebase into a single LLM prompt, creating a source tree, prompt templating, and token counting. It allows generating prompts from the entire code directory, integrates with GPT and Claude models, and supports file filtering, token counting, and copying results to the clipboard.
FunClip: Open-source Video Speech Recognition and Clipping Tool - FunClip is an open-source tool for automated video clipping with speech recognition, leveraging LLMs for precise clipping. It integrates Paraformer models for ASR, allowing users to select text segments or speakers, and offers speech recognition and subtitle generation features. FunClip runs locally and can be accessed via a Gradio interface or command line.
YaFSDP: Yet Another Fully Sharded Data Parallel - YaFSDP is a sharded data parallelism framework designed to work with transformer-like neural network architectures. It is up to 20% faster than FSDP, performs better under high memory pressure, and reduces communication and memory operations overhead. YaFSDP supports pre-training models from 7B to 70B parameters on setups from 64 to 256 devices, with examples for causal pre-training and supervised fine-tuning.
Jailbreak LLMs: Dataset of Jailbreak Prompts for Large Language Models - Jailbreak LLMs is a dataset containing 15,140 prompts from platforms like Reddit, Discord, and other sources, including 1,405 jailbreak prompts. This dataset is the largest collection of jailbreak prompts used to assess the risk and performance of large language models in generating potentially harmful content.
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data - DIRECT-3D is a generative model that directly creates 3D content from text prompts in a single forward pass without optimization. It enables fast generation of high-quality 3D objects with accurate geometric details and various textures, significantly enhancing efficiency through a Stable Diffusion-like architecture.
Thread: AI-powered Python Notebook Built in React - Thread is an AI-powered notebook combining OpenAI's code interpreter with a Python notebook interface, enabling cell generation, code editing, automatic error fixing, and natural language interaction. It runs locally and integrates with the OpenAI API, built on React, providing a familiar Jupyter-like editing experience.
Mistral.rs: Blazingly Fast LLM Inference - Mistral.rs is a fast LLM inference platform supporting various devices, quantization, and easy integration with an OpenAI API-compatible HTTP server and Python bindings. It enables efficient model execution on Apple Silicon, CPU, and GPU devices with CUDA and Metal support, offering support for text and vision models.
LlamaGen: Autoregressive Model Beats Diffusion in Image Generation - LlamaGen is a new family of image generation models that apply the original next-token prediction paradigm, achieving state-of-the-art performance without visual inductive biases. The model scales from 100M to 3B parameters, providing high-quality image generation in class-conditional and text-conditional settings.
TORAX: Tokamak Transport Simulation in JAX - TORAX is a tokamak core transport simulator based on JAX, enabling fast and accurate modeling, pulse design, trajectory optimization, and controller design. The library utilizes automatic differentiation and code compilation, supporting applications like sensitivity analysis of simulation results and gradient-based optimization.
AXLearn: An Extensible Deep Learning Library by Apple - AXLearn is a deep learning library based on JAX and XLA, supporting large-scale model development, enabling training with hundreds of billions of parameters, and integrating with other libraries like Flax and Hugging Face.
DiffusionKit: On-Device Inference of Diffusion Models for Apple Silicon - DiffusionKit is a tool that enables converting PyTorch models to Core ML format and performing diffusion model inference directly on Apple Silicon devices, supporting text-to-image generation using MLX and Core ML.
MDLM: Simple and Effective Masked Diffusion Language Models - MDLM is an innovative diffusion language model combining classical masked language modeling techniques with diffusion processes, achieving state-of-the-art perplexity on various datasets and offering efficient sampling and training methods.
How to Build an AI Data Center - Brian Potter discusses key challenges and strategies for building AI data centers, highlighting the importance of physical infrastructure, energy consumption, and location to meet the growing computational demands of advanced AI models.
Maintaining Large-Scale AI Capacity at Meta - Meta describes how it maintains its vast AI infrastructure, consisting of thousands of GPU clusters supporting advanced training tasks, implementing innovative maintenance and performance optimization approaches to meet the growing computational demands of generative AI.
LLM²: Using AI to Optimize AI Model Training - Sakana AI introduces the LLM² method, using evolutionary algorithms and language models to automatically discover and optimize preference functions, leading to more efficient training algorithms for large language models with minimal human intervention.
Stable Diffusion 3 Medium: Advanced Text-to-Image Generative Model
Stable Diffusion 3 Medium is a multimodal diffusion text-to-image model that significantly improves image quality, typography, and complex prompt understanding, available under a non-commercial research license.Luma Dream Machine: High-Quality Videos from Text and Images
Luma Dream Machine is a scalable AI model that quickly generates realistic videos from text and images, featuring smooth camera movements, accurate physics, and consistent characters, available for everyone.
Claude's Character: Shaping the Traits of an AI Model - what should an AI's personality be? - Anthropic details their approach to character training for the AI model Claude, aiming to instill traits like curiosity, open-mindedness, and ethical thinking to make interactions more engaging and responsible.
Building AI Products - key challenges and strategies for developing AI-based products, highlighting the integration of AI technologies with existing systems and the importance of ethics and transparency in AI development.
Apple is Putting ChatGPT in Siri for Free Later This Year - Apple announces a partnership with OpenAI to integrate ChatGPT into Siri in iOS 18 and macOS Sequoia, providing users with free access without requiring an account, ensuring privacy, and offering premium features for subscribers.
Introducing Apple Foundation Models: On-Device and Server Models - Apple introduces Apple Foundation Models, including on-device and server-based language models designed to support users in everyday tasks while ensuring privacy and responsible AI development.
Open Sourcing Unity Catalog: Creating the Industry’s Only Universal Catalog for Data and AI - Databricks open-sources Unity Catalog, the first universal catalog for data and AI governance, supporting multiple data formats and compute engines, promoting interoperability and eliminating data silos.
Generative AI Is Not Going to Build Your Engineering Team for You
While generative AI can support engineering processes, it cannot replace key aspects of team building like soft skills, work culture, and managing human interactions.