Warsaw.AI News 21-27.10.2024
We invite you to check the AI news that we found for you in the week of 21-27.10.2024:
#LLM #Stanford
Building Large Language Models - This lecture provides a concise overview of building a ChatGPT-like model, covering both pretraining (language modeling) and post-training (SFT/RLHF). For each component, it explores common practices in data collection, algorithms, and evaluation methods. This guest lecture was delivered by Yann Dubois in Stanford’s CS229: Machine Learning course, in Summer 2024.
#Biocomputing
“Lab-grown human brain cells drive virtual butterfly in simulation” - Switzerland-based startup FinalSpark’s Neuroplatform allows researchers to code interactions with brain organoids, small neural structures derived from stem cells, enabling these "mini-brains" to respond to stimuli. For demonstration, researchers created a butterfly model in a virtual world, where human clicks prompt neuron responses that guide the butterfly’s movement.#AIEthics #AIAlignment
“Sabotage evaluations for frontier models” - Researchers highlight that advanced AI models might develop “sabotage capabilities” to undermine human oversight, particularly in AI development. Evaluations of Anthropic’s Claude models suggest current safeguards can manage these risks, though more robust mitigations may become essential as model capabilities advance.#GenerativeAI #ImageGeneration
“One Step Diffusion via Shortcut Models" - Shortcut models simplify and speed up image generation by conditioning a single network on both noise level and step size, allowing for efficient high-quality sampling in fewer steps. This approach reduces complexity compared to previous methods, enabling a single training phase and flexible sampling budgets at inference time.#GenerativeAI #ImageSynthesis
“BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities” - BiGR is a new conditional image generation model that unifies generative and discriminative capabilities using compact binary codes for efficient, high-quality image generation and representation. Its unique binary tokenizer, masked modeling, and entropy-ordered sampling enable versatile applications like inpainting and editing, achieving strong performance on generation quality and representation benchmarks.#Interpretability
“Automatically Interpreting Millions of Features in Large Language Models” - This study introduces an open-source pipeline that uses large language models (LLMs) to automatically generate and evaluate natural language explanations for sparse autoencoder (SAE) features, aiming to make deep neural network activations more interpretable. By proposing five new scoring techniques, including intervention scoring, this work offers efficient methods to assess explanation quality, demonstrating that SAEs provide a more interpretable latent space than individual neuron activations.#ComputerVision #DiffusionModel #Image-to-video
“CamI2V: Camera-Controlled Image-to-Video Diffusion Model” - This study enhances text-to-video diffusion models by incorporating camera pose as a direct, physics-based control for more accurate and interpretable camera motion in generated videos. Key innovations include "epipolar attention" for cross-frame consistency and "register tokens" to handle rapid camera shifts and dynamic objects, achieving a 25.5% improvement in camera controllability on RealEstate10K with efficient resource use.#3DReconstruction #GAN
“3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation” - This paper introduces a novel approach for estimating facial geometry and texture from a single image, utilizing StyleGAN to generate multi-view faces and 3D Morphable Models (3DMM) for detailed 3D mesh and texture estimation. The method yields high-quality 3D face meshes with accurately matched textures, even when the face is rotated, improving over traditional single-view texture estimation challenges.#Multimodal
“xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs” - This paper introduces xGen-MM-Vid (BLIP-3-Video), a multimodal language model for video that efficiently captures temporal information across frames. Leveraging a "temporal encoder" and a compact visual token set, BLIP-3-Video achieves video question-answering accuracy similar to much larger models but with significantly fewer tokens, making it more efficient.#ObjectSegmentation
“SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree” - The Segment Anything Model 2 (SAM 2) is a leading model for object segmentation in images and videos, utilizing a memory module to track objects across frames. SAM2Long enhances SAM 2 by mitigating error accumulation in long-term video segmentation, using a pathway-based tree search to maintain accuracy across frames, resulting in an average 3.0-point improvement on segmentation benchmarks.#SequentialReasoning #AIAdaptability
“CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing” - The CoPS (Cross-Task Experience Sharing) algorithm enhances sequential reasoning for agent systems by enabling experience selection across different tasks, using a strategy that matches experiences to minimize distribution shift risks. Experimental results demonstrate CoPS’ effectiveness, outperforming existing methods on complex benchmarks like Alfworld and HotPotQA, while also achieving better sample efficiency for resource-limited environments.#SemanticSceneCompletion #KITTI
“TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight” - TALoS introduces a test-time adaptation strategy for Semantic Scene Completion that uses real-time observations from LiDAR sensors to enhance model accuracy by providing self-supervision for occupancy and emptiness. By aggregating reliable predictions and utilizing a dual optimization scheme, TALoS significantly boosts performance on driving scenarios, as validated on the SemanticKITTI dataset.#FewShotLearning
“A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration” - This work explores Few-shot Chain-of-Thought (CoT) prompting, demonstrating that integrating coherent reasoning enhances the predictive accuracy and error correction abilities of large language models compared to isolated stepwise reasoning. The findings indicate that transformers are more sensitive to errors in intermediate reasoning steps, leading to the proposal of an improved CoT approach that incorporates both correct and incorrect reasoning paths during demonstrations.#ImitationThreshold
“How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold” - This study investigates the imitation phenomenon in text-to-image models, examining the correlation between the frequency of concepts in training datasets and the model's ability to replicate those concepts, termed the "imitation threshold." The findings indicate that a model requires between 200 and 600 instances of a concept to effectively imitate it, providing insights for ensuring compliance with copyright and privacy laws in model development.#AIInterpretability
“Looking Inward: Language Models Can Learn About Themselves by Introspection” - The paper explores whether LLMs can engage in introspection, defined as acquiring knowledge about their internal states that isn't solely derived from training data. By finetuning models to predict their own behaviors in hypothetical scenarios, the study provides evidence that LLMs can introspect, outperforming other models in self-prediction, though challenges remain for complex tasks and out-of-distribution generalization.#PointClouds #CNN #U-Net
“Joint Point Cloud Upsampling and Cleaning with Octree-based CNNs” - The study introduces a simple and efficient method for upsampling and cleaning point clouds using an octree-based 3D U-Net (OUNet) with minimal modifications. This approach processes entire point clouds rather than patches, significantly improving inference speed by at least 47 times while achieving state-of-the-art performance across various benchmarks.#MultilingualAI
“Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages” - The paper introduces Pangea, a multilingual multimodal large language model (MLLM) trained on a diverse dataset, PangeaIns, which includes 6 million instructions in 39 languages, focusing on culturally relevant tasks. The study also presents PangeaBench, an evaluation suite for assessing the model's capabilities across 47 languages, showing that Pangea outperforms existing models and emphasizing the importance of data diversity for developing inclusive MLLMs.#LLM #RAG
“LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering” - The paper presents LongRAG, a robust RAG system designed to improve Long-Context Question Answering (LCQA) by addressing the challenges of global context comprehension and noise in long documents. LongRAG acts as a plug-and-play paradigm that enhances the effectiveness of existing long-context LLMs, demonstrating significant performance improvements over current state-of-the-art models across various datasets.#FederatedLearning
“Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data” - The paper introduces the Federated Transformer (FeT), a novel framework designed to enhance multi-party vertical federated learning (VFL) using fuzzy identifiers, addressing performance and privacy challenges inherent in current models. FeT employs a transformer architecture across various parties and integrates differential privacy with secure multi-party computation, achieving up to 46% improvement in accuracy while maintaining strong privacy protections in collaborative settings. You can also check the code.#VisionLanguageModels #ObjectHallucination
“Mitigating Object Hallucination via Concentric Causal Attention” - This paper examines the phenomenon of object hallucination in Large Vision Language Models (LVLMs), identifying its correlation with Rotary Position Encoding (RoPE). The study finds that LVLMs generate inaccurate textual responses when relevant visual cues are far from instruction tokens due to the long-term decay of RoPE. To address this issue, the authors propose Concentric Causal Attention (CCA), a novel positional alignment strategy that reduces the relative distance between visual and instruction tokens, improving their interaction. CCA significantly enhances the model's perception capabilities and effectively reduces object hallucination compared to existing methods across multiple benchmarks. You can also check this repository.#SpeechRecognition
“Moonshine: Speech Recognition for Live Transcription and Voice Commands” - Moonshine is a family of speech recognition models utilizing an encoder-decoder transformer architecture with Rotary Position Embedding, enabling efficient live transcription and voice command processing without zero-padding. Compared to OpenAI's Whisper tiny-en, Moonshine Tiny achieves a 5x reduction in compute requirements for 10-second speech transcriptions while maintaining equivalent word error rates, making it ideal for resource-constrained applications. You can also check this repository.
#Anthropic #Claude
Anthropic Introduces AI Computer Use with Claude - Anthropic’s Claude AI now interacts with computers like a human, performing screen-based tasks such as cursor movements and typing. This breakthrough, now in public beta, promises applications across software environments but raises safety considerations, particularly around preventing misuse and ensuring reliability. You can also check an initial exploration of this feature, the code repository and this demonstration.#Meta #SegmentAnything
Meta Unveils Segment Anything 2.1 and SPIRIT LM for Enhanced AI Capabilities - Meta has introduced Segment Anything 2.1, an advanced model designed to improve object segmentation in images, alongside SPIRIT LM, a language model aimed at enhancing natural language understanding. These innovations are part of Meta's ongoing efforts to push the boundaries of AI technology, offering improved performance and efficiency for scientific and programming applications.#DeepMind #GenAI
New Generative AI Tools in Music Creation by DeepMind - DeepMind's new generative AI tools, such as MusicFX DJ and Music AI Sandbox, allow users to create and manipulate music in real-time using text prompts. This suite of tools, developed in collaboration with music industry professionals, provides high-quality, production-ready audio and enables creative experimentation with intuitive controls for artists and enthusiasts alike.#Anthropic #Claude
New Analysis Tool in Claude.ai by Anthropic - Anthropic's Claude now features an analysis tool allowing it to execute JavaScript code, making it capable of handling complex data tasks and producing real-time insights.#Midjourney #ImageEditing
Midjourney Launches AI Image Editor: A Guide for Use - Midjourney has introduced a new AI-powered image editor designed to enhance and modify images with advanced features. The tool leverages machine learning algorithms to provide users with intuitive editing capabilities, making it a valuable resource for scientists and programmers interested in AI-driven image processing.#Anthropic
Anthropic Releases AI Tool Capable of Controlling User's Mouse Cursor
Anthropic has publicly released an AI tool designed to take control of a user's mouse cursor, potentially automating tasks and enhancing user interaction with digital interfaces. This development raises questions about user privacy and the ethical implications of AI systems with direct control over hardware components.#GPU #Cloud
Lambda Labs Introduces High-Performance GPU Cloud Service - Lambda Labs has launched a GPU cloud service designed to provide scalable and efficient computing power for machine learning and AI applications. The service offers a range of GPU options, including NVIDIA's latest models, to cater to the needs of scientists and developers seeking robust computational resources.#ElevenLabs #VoiceGeneration
Voice design - Generating a custom voice based on a text prompt by ElevenLabs.#XAI #API
XAI, Elon Musk's AI Startup, Launches an API - XAI, the artificial intelligence startup founded by Elon Musk, has introduced a new API designed to facilitate integration with various applications. This API aims to provide developers with advanced AI capabilities, enhancing the functionality and efficiency of their software solutions.#LangChain
LangChain's Memory for Agents - LangChain explores how agents can use procedural, semantic, and episodic memory to enhance user experience, with customizable memory options for specific applications. Developers can choose between real-time and background memory updates to optimize performance, supporting more adaptable and responsive agents.
#Training #Deployment
AutoTrain Advanced: faster and easier training and deployments of state-of-the-art machine learning models - a no-code solution that allows you to train machine learning models in just a few clicks.#PyTorch #Edge
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices.#LLM #Code
Open Interpreter lets LLMs run code (Python, Javascript, Shell, and more) locally. You can chat with Open Interpreter through a ChatGPT-like interface in your terminal.Repopack is a tool that packs your entire repository into a single, AI-friendly file.
#LLM #PyTorch
Meta Lingua is a minimal and fast LLM training and inference library designed for research. Meta Lingua uses easy-to-modify PyTorch components in order to try new architectures, losses, data, etc.#SLM #Mobile
PocketPal AI is a pocket-sized AI assistant powered by small language models (SLMs) that run directly on your phone. Designed for both iOS and Android, PocketPal AI lets you interact with various SLMs without the need for an internet connection.#LLM
Composio equip's your AI agents & LLMs with 100+ high-quality integrations via function calling.#Training
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning. The authors propose a novel data synthesis framework that tailors the data synthesis ability of the teacher toward the student’s learning process. You can also check this article.#InformationRetrieval
CLaMP 2 - is a music information retrieval model compatible with 101 languages.#Anthropic #API
Anthropic Quickstarts is a collection of projects designed to help developers quickly get started with building applications using the Anthropic API. Each quickstart provides a foundation that you can easily build upon and customize for your specific needs.#TogetherAI
The Together Cookbook is a collection of code and guides designed to help developers build with open-source models using Together AI.#LLM #Microsoft
BitNet - an official inference framework for 1-bit LLMs, released by Microsoft.
#StableDiffusion #StabilityAI
StabilityAI released Stable Diffusion 3.5 - This open release includes multiple model variants, including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo. Additionally, Stable Diffusion 3.5 Medium will be released on October 29th.#VideoGeneration
Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation.#text-to-video
Allegro - small and efficient text-to-video generation model.#multimodal #search
Embed 3 - multimodal AI search model released by Cohere.#Llama #Meta
Quantized Llama models with increased speed and a reduced memory footprint.#Aya #Cohere
Aya Expanse 8B is an auto-regressive language model with highly advanced multilingual capabilities that uses an optimized transformer architecture. Post-training includes supervised finetuning, preference training, and model merging.#IBM #Granite
IBM Introduces Granite 3.0: High-Performing AI Models Built for Business. New Granite 3.0 8B & 2B models, released under the permissive Apache 2.0 license, show strong performance across many academic and enterprise benchmarks, able to outperform or match similar-sized models.
#OpenAI #Ethics #Legal
OpenAI Accused of Copyright Infringement and Internet Disruption
A former OpenAI employee has raised concerns that the company is violating copyright laws and potentially harming the internet's integrity. The allegations suggest that OpenAI's practices could have significant legal and ethical implications for the tech industry.#Energy #Sustainability
Big Tech's Shift to Nuclear Power Amid AI Sustainability Concerns
Major technology companies are increasingly investing in nuclear power as a sustainable energy source to address the growing environmental impact of artificial intelligence. This shift highlights the industry's commitment to reducing carbon footprints and ensuring long-term energy security for AI operations.
#arxiv
Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. This dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. It can be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.