Warsaw.AI News 16-22.09.2024
We invite you to check the AI news that we found for you in the week of 16-22.09.2024.
#Gemini #Google #interview
An interview with Jeff Dean about the past, present and future of AI, especially the long-term potential of multi-modal models like Gemini.#OpenAI #LLM
[PL only] An interview with Wojciech Zaremba about his history and AI.#LLM #o1 #OpenAI
Terence Tao ran some experiments to investigate o1’s ability to solve mathematical problems. He concludes that it is better than its predecessors and is already quite close to the level of graduate students but in some aspects (like generating creating strategies) LLMs are still rather weak.#ChatGPT #AISafety #AIRisks
Hacker tricks ChatGPT into giving out detailed instructions for making homemade bombs - was able to trick ChatGPT into producing the bomb-making instructions by telling the bot to “play a game,” after which the hacker used a series of connecting prompts to get the chatbot to create a detailed science-fiction fantasy world where the bot’s safety guidelines would not apply. Tricking a chatbot into escaping its preprogrammed restrictions is known as “jailbreaking.”#Copilot
Why Copilot is Making Programmers Worse at Programming - AI-driven tools like GitHub's Copilot enhance productivity by generating code and providing solutions, but they risk eroding fundamental programming skills and fostering over-reliance on AI-generated code. This over-dependence can lead to reduced learning opportunities, narrowed creative thinking, and a false sense of expertise, ultimately hindering long-term skill development and problem-solving abilities in developers.
#LangChain #LangGraph
LangChain Academy Introduces Introductory Course on LangGraph -
LangChain Academy has launched a new course titled "Intro to LangGraph," aimed at providing foundational knowledge on LangGraph. The course covers essential concepts and practical applications, making it a valuable resource for scientists and programmers interested in this technology.
#RL #LLM
“Training Language Models to Self-Correct via Reinforcement Learning” - Self-correction in LLMs has been largely ineffective, with current methods relying on multiple models or external supervision. SCoRe, a new multi-turn reinforcement learning approach, enhances self-correction by using self-generated data and regularization, improving self-correction in Gemini models by 15.6% and 9.1% on MATH and HumanEval benchmarks, respectively.#LLM #MultiModal
“NVLM: Open Frontier-Class Multimodal LLMs” - NVLM 1.0 is a frontier-class multimodal LLM that outperforms both proprietary and open-access models in vision-language tasks, while also improving text-only performance after multimodal training. Its novel architecture and tile-tagging design enhance multimodal reasoning, and the model's success highlights the importance of dataset quality and diversity over sheer scale during pretraining.#CodeGeneration #QwenCoder
“Qwen2.5-Coder Technical Report” - The Qwen2.5-Coder series, featuring the Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B models, is a major upgrade over its predecessor, CodeQwen1.5, delivering state-of-the-art performance in code generation, completion, reasoning, and repair. Built on the Qwen2.5 architecture and pretrained on over 5.5 trillion tokens, this series excels across more than 10 benchmarks and is designed to advance code intelligence research while promoting real-world developer adoption through its permissive licensing. You can also read more about Qwen2.5 models here.#RecommenderSystem
“beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems” - The paper introduces beeFormer, a framework for training sentence Transformer models using interaction data specific to recommender systems, surpassing traditional collaborative filtering and semantic similarity models in performance. beeFormer allows knowledge transfer between datasets, supporting the development of universal, domain-agnostic models for mining text representations in recommendation systems, with all resources and code publicly available.#RAG #Trustworthiness #survey
“Trustworthiness in Retrieval-Augmented Generation Systems: A Survey” - The paper introduces a unified framework to evaluate the trustworthiness of Retrieval-Augmented Generation (RAG) systems across six dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. It reviews the existing literature, creates an evaluation benchmark, and highlights potential challenges, aiming to provide insights for improving the reliability of RAG systems in real-world applications.#3DGeneration #ComputerVision
“Vista3D: Unravel the 3D Darkside of a Single Image” - The paper introduces Vista3D, a framework for rapid and consistent 3D object generation from single images within five minutes, using a two-phase approach: coarse geometry generation with Gaussian Splatting, followed by fine-tuning through Signed Distance Functions (SDF). Vista3D also employs disentangled representations and angular diffusion prior composition to improve both the visible and obscured aspects of 3D objects, maintaining a balance between consistency and diversity.#LLM #SmallModels #survey
“What is the Role of Small Models in the LLM Era: A Survey” - The paper explores the often overlooked role of Small Models (SMs) in contrast to Large Language Models (LLMs), highlighting their practical significance despite the focus on larger models like GPT-4 and LLaMA-405B. It examines the relationship between LLMs and SMs in terms of collaboration and competition, aiming to promote a more efficient use of computational resources in the AI community.#LLM
“LLaMA-Omni: Seamless Speech Interaction with Large Language Models” - The paper introduces LLaMA-Omni, a novel architecture for low-latency, high-quality speech interaction with LLMs, eliminating the need for speech transcription by generating both text and speech responses directly from speech instructions. Built on the Llama-3.1-8B-Instruct model and trained on the InstructS2S-200K dataset, LLaMA-Omni achieves faster response times and enhanced content quality compared to prior speech-language models, with training efficiency requiring just 4 GPUs.#LLM #Hallucinations
“LLMs Will Always Hallucinate, and We Need to Live With This” - This work asserts that hallucinations in Large Language Models (LLMs) are not just sporadic errors but an inevitable aspect of their design, stemming from fundamental mathematical and logical structures. Drawing on computational theory and Godel's First Incompleteness Theorem, the authors introduce the concept of Structural Hallucination, arguing that every phase of the LLM process is prone to errors, thereby challenging the belief that such hallucinations can be entirely eliminated.#BERT
“AudioBERT: Audio Knowledge Augmented Language Model” - Recent research indicates that language models pretrained on text-only datasets often lack basic visual knowledge, prompting an investigation into their auditory knowledge. To address this, the authors introduce AuditoryBench, a dataset designed to evaluate auditory knowledge, and propose AudioBERT, a method that augments BERT's auditory knowledge through a retrieval-based approach, resulting in improved performance on the benchmark.#Mamba
“PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba” - Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, but existing deep learning approaches often struggle with long-range spatio-temporal dependencies. To overcome these limitations, the authors propose PhysMamba, a Mamba-based framework that employs a Temporal Difference Mamba block for local dynamic enhancement and a dual-stream SlowFast architecture for fusing multi-scale temporal features, demonstrating superior performance across three benchmark datasets.#3DSegmentation
“FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally” - This study introduces a globally optimal solver for segmenting 3D Gaussian Splatting (3D-GS) from 2D masks, significantly improving upon conventional methods that rely on lengthy iterative optimization. By leveraging linear programming for optimal label assignment and incorporating background bias, the proposed method achieves robust segmentation within 30 seconds, demonstrating remarkable efficiency and effectiveness in various applications, including object removal and inpainting.#MoE
GRIN-MoE - The GRIN (GRadient-INformed MoE training) framework enhances the training of Mixture-of-Experts (MoE) models by incorporating sparse gradient estimation for expert routing, addressing the challenges posed by traditional backpropagation. By applying GRIN to autoregressive language modeling, researchers developed a top-2 16×3.8B MoE model that, with only 6.6B activated parameters, outperforms a 7B dense model and matches a 14B dense model, achieving impressive results on multiple benchmarks.#HuBERT #SelfSupervised
“Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT” - This study presents a novel self-supervised fine-tuning approach for speech representation learning that effectively separates syllabic units from speaker identity in audio data. By introducing speaker perturbation as a data augmentation technique and employing a frame-level training objective, the proposed method improves syllable segmentation and unit quality metrics on Librispeech, outperforming current state-of-the-art methods.
#Nemotron
Nemotron-Mini-4B-Instruct is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. It is a fine-tuned version of nvidia/Minitron-4B-Base, which was pruned and distilled from Nemotron-4 15B#LLM #DeepSeek
DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct. The new model integrates the general and coding abilities of the two previous versions. DeepSeek-V2.5 better aligns with human preferences and has been optimized in various aspects, including writing and instruction following.#Speech
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Its repo is here.
#3DReconstruction
Starst3r - Ultra fast 3D reconstruction and novel view synthesis.#inference #optimization #OpenAI
optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries. It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.#AWS
AWS AI Stack – A ready-to-use, full-stack boilerplate project for building serverless AI applications on AWS. A great fit for those seeking a trusted AWS foundation for AI apps and access to powerful LLM models via Bedrock that keep your app’s data separate from model providers.#AIAgents
ioa - An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity.#RAG
paper-qa - High accuracy RAG for answering questions from scientific documents with citations.#ComputerVision
HivisionIDPhotos - a lightweight and efficient AI ID photos tools.#LLM
MiniCPM3-4B - An edge-side LLM that surpasses GPT-3.5-Turbo.#DeepFakes
FaceSwap is a tool that utilizes deep learning to recognize and swap faces in pictures and videos.#AIAgent
Agent Workflow Memory (AWM) proposes to induce, integrate, and utilize workflows via an agent memory. A workflow is usually a common sub-routine in solving tasks, with example-specific contexts being abstracted out.#benchmark #LLM
super-benchmark - A benchmark and resources for evaluation of LLM agents on setting up and executing ML/NLP tasks from research repositories in the GitHub wild.#LLama
llama-parse-cli - A non-official command-line interface (CLI) for parsing documents using the LlamaIndex Parser.#FLUX #LoRA
flux-fine-tuner - A Cog training model that creates LoRA-based fine-tunes for the FLUX.1 family of image generation models.#ComputerVision
supervision - a tool for writing reusable computer vision tools.#AIAgent
Agent Zero - a general-purpose assistant.#CodeGeneration
taipy - Turns Data and AI algorithms into production-ready web applications.#LLM #framework
Opik - an open-source platform for evaluating, testing and monitoring LLM applications.#NLP #Llama
WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.#VideoSummarizer
Surveillance Video Summarizer is a AI-driven system that processes surveillance videos, extracts key frames, and generates detailed annotations. Powered by a fine-tuned Florence-2 Vision-Language Model (VLM) specifically trained on the SPHAR dataset, it highlights notable events, actions, and objects within video footage and logs them for easy review and further analysis.
#Salesforce #AIAgent #reasoning
Salesforce Unveils AI Agents with Advanced Reasoning Capabilities
Salesforce has introduced new AI agents, codenamed "Atlas," which are designed to enhance reasoning and decision-making processes. These agents aim to improve customer interactions by leveraging advanced machine learning techniques to understand and respond to complex queries.#Google #audio
NotebookLM: Enhancing AI-Assisted Note-Taking with Audio Overviews
Google introduces NotebookLM, an AI-powered tool designed to improve note-taking by providing audio overviews. This feature aims to help users quickly grasp the essence of their notes through concise audio summaries, enhancing productivity and comprehension.#API #ImageGeneration
Luma Labs Introduces Dream Machine API for Enhanced AI Image Generation
Luma Labs has launched the Dream Machine API, designed to facilitate advanced AI-driven image generation. This API leverages state-of-the-art machine learning models to produce high-quality, customizable images for various applications.#Microsoft #Copilot
Microsoft Introduces M365 Copilot for Enhanced Productivity
Microsoft has unveiled M365 Copilot, an AI-driven tool designed to augment productivity within its suite of applications. The tool leverages advanced machine learning algorithms to assist users in tasks such as drafting emails, generating reports, and analyzing data, thereby streamlining workflows for scientists and programmers.#O1 #OpenAI #Copilot
OpenAI O1 Model Now Available in GitHub Copilot
GitHub has integrated the OpenAI O1 model into GitHub Copilot, enhancing its code completion capabilities. This update aims to improve the accuracy and efficiency of code suggestions for developers.#Devin #CodeGeneration
Cognitive Labs' Devin AI Receives New Features and Upgrades
Cognitive Labs has introduced significant updates to its Devin AI, enhancing its natural language processing capabilities and user interface. These improvements aim to provide more accurate responses and a more intuitive user experience for scientists and programmers, enhancing code editing speed and accuracy.#VideoGeneration
Snap Introduces AI Video Generation Tool for Creator
Snap has launched a new AI-powered video generation tool aimed at content creators. This tool leverages advanced machine learning algorithms to automate video production, enhancing efficiency and creativity in content creation.#Apple #iOS
Apple Releases iOS 18.1 Public Beta with Enhanced AI Features
The latest iOS 18.1 public beta introduces advanced artificial intelligence capabilities, including improved Siri functionality and enhanced machine learning algorithms. These updates aim to provide a more intuitive user experience and better performance for AI-driven applications.#DeepMind #VideoEditing
Empowering YouTube Creators with Generative AI
DeepMind introduces a new generative AI tool designed to assist YouTube creators in producing high-quality content more efficiently. This tool leverages advanced machine learning algorithms to automate video editing tasks, thereby enhancing creativity and productivity.#Meta #LLaMA #CodeGeneration
Meta and Together Launch LLaMACoder for Enhanced AI Coding
Meta and Together have introduced LLaMACoder, a new AI model designed to improve coding efficiency and accuracy. This model leverages advanced machine learning techniques to assist programmers in generating and debugging code more effectively.#Eureka #Microsoft
Eureka: Evaluating and Understanding Progress in AI
Microsoft Research discusses the development of Eureka, a framework designed to evaluate and understand advancements in artificial intelligence. The framework aims to provide a comprehensive analysis of AI progress, focusing on both technical performance and broader impacts.#HuggingFace #SQL
Hugging Face Introduces SQL Console for Enhanced Data Interaction
Hugging Face has launched a new SQL Console feature, enabling users to interact with datasets using SQL queries directly within the platform. This tool aims to streamline data analysis and manipulation for scientists and programmers, enhancing productivity and efficiency.
#OpenAI
OpenAI Plans to Transition from Non-Profit Structure in 2024
OpenAI is reportedly planning to move away from its current non-profit structure next year. This shift aims to streamline operations and potentially attract more investment to support its AI research and development.#AIAgent
AI Copilots and AI Agents: Transforming White-Collar Roles
The article discusses the emergence of AI copilots and AI agents, which are designed to assist in white-collar jobs by automating routine tasks and enhancing productivity. It explores the potential impact on various industries, the technological advancements driving these changes, and the ethical considerations involved.#Salesforce
Salesforce Unveils Comprehensive AI Strategy to Enhance Customer Experience
Salesforce has announced a new AI strategy aimed at integrating artificial intelligence across its platform to improve customer interactions and operational efficiency. The strategy includes the development of AI-driven tools and features designed to provide personalized experiences and predictive insights for users.
#audio #Fraunhofer
ODAQ (Open Dataset of Audio Quality) - it contains 240 audio samples accompanied by corresponding quality scores.#TextGeneration
FinePersonas contains detailed personas for creating customized, realistic synthetic data. With this dataset, AI researchers and engineers can easily integrate unique persona traits into text generation systems, enhancing the richness, diversity, and specificity of synthetic outputs without the complexity of crafting detailed attributes from scratch.#VideoUnderstanding
Neptune - a new open-source video question-answering dataset to test long-video-understanding.