Warsaw.AI News 23-29.09.2024
We invite you to check the AI news that we found for you in the week of 23-29.09.2024:
#IDEAS #NCBR #Sankowski
[PL only] Piotr Sankowski is leaving IDEAS NCBR, but the circumstances of not re-electing him as CEO are still controversial. The AI community has prepared a petition to the Prime Minister of Poland to explain the exact circumstances of this situation.#AISafety
Global AI Safety: Scientists Call for International Cooperation - Leading AI researchers from the US and China have issued a joint statement urging international collaboration on AI safety regulations, proposing a framework for safety assurance and independent research funding. This unprecedented consensus highlights the urgent need for global governance to mitigate catastrophic risks posed by advanced AI.#Sutskever
Sutskever's reading list
#OpenAI
OpenAI Launches Global Affairs Academy OpenAI has introduced the OpenAI Academy, aimed at fostering a deeper understanding of AI's global impact among policymakers, researchers, and industry leaders. The academy will provide educational resources and training to help stakeholders navigate the complexities of AI technology and its societal implications.
#MIT #Medicine
Accelerating particle size distribution estimation - MIT researchers speed up a novel AI-based estimator for medication manufacturing by 60 times.#MultiModal #ImageEditing
ControlEdit: A MultiModal Local Clothing Image Editing Method - Multimodal clothing image editing allows precise adjustments of clothing images using textual and visual data, improving designer efficiency and simplifying user design. The proposed method, ControlEdit, uses self-supervised learning and multimodal-guided local inpainting to achieve seamless clothing image editing with natural boundary transitions, outperforming baseline algorithms in various evaluations.#AIagents
The Impact of Element Ordering on LM Agent Performance - Recent research shows that the order in which UI elements are presented significantly affects the performance of language model agents navigating virtual environments. Experiments reveal that randomized element ordering can degrade performance as much as removing visible text, and dimensionality reduction offers a solution in pixel-only environments, leading to major improvements in task completion.#DeepMind
AlphaProteo generates novel proteins for biology and health research - a New AI system designs proteins that successfully bind to target molecules, with the potential for advancing drug design, disease understanding and more.#AIAlignment
RRM: Robust Reward Model Training Mitigates Reward Hacking - Reward models (RMs) help align large language models with human preferences, but current training methods struggle to separate prompt-driven preferences from irrelevant factors like response length. To address this, a new causal framework and data augmentation technique were introduced, resulting in a more robust reward model (RRM) that improves model performance and significantly enhances DPO-aligned policies.#ComputerVision
NoTeeline: Supporting Real-Time Notetaking from Keypoints with Large Language Models - NoTeeline is an interactive system that simplifies real-time, personalized notetaking while watching videos by allowing users to jot down brief micronotes, which are automatically expanded into detailed, stylistically consistent notes. A study found that NoTeeline reduces mental effort, shortens text input by 47%, and cuts notetaking time by 43.9%, while improving note quality and factual accuracy.#AIEthics #GenerativeAI
No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices - Advances in generative models have enabled AI to produce text, code, and images resembling human content, prompting the use of watermarking to verify the source and prevent misuse. However, this study reveals vulnerabilities in current watermarking methods, showing they are susceptible to removal or spoofing attacks, and offers guidelines to improve robustness and utility in practice.#efficiency
BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices - Deep neural networks (DNNs) excel in cognitive tasks but are limited by high computational complexity and memory use, making real-time execution on embedded platforms challenging. To address this, the BitQ framework optimizes Block Floating Point (BFP) quantization, improving DNN performance on embedded platforms by finding the best block size and bitwidth allocation, resulting in efficient computation while preserving accuracy.#ExplainableAI
CaBRNet, an open-source library for developing and evaluating Case-Based Reasoning Model - In explainable AI, self-explainable models offer a principled alternative to post-hoc explanations, but challenges like reproducibility and inconsistent standards persist. CaBRNet addresses these issues by providing an open-source, modular, and backward-compatible framework for Case-Based Reasoning Networks, promoting easier comparison and collaboration.#ImageGeneration #TransformerModels
MaskBit: Embedding-free Image Generation via Bit Tokens - Masked transformer models are emerging as strong alternatives to diffusion models for class-conditional image generation, typically using a VQGAN for latent-to-image space transitions and a Transformer for generation. This study modernizes VQGAN with improved transparency and performance, while also introducing an embedding-free generation network using bit tokens, achieving a state-of-the-art FID score of 1.52 on ImageNet with a compact 305M parameter model.#VisionLanguageModels
ComiCap: A VLMs pipeline for dense captioning of Comic Panels - Recent advancements in the comic domain have led to the development of models for single- and multi-page analysis, capable of detecting and linking comic elements like panels, characters, and text. This work introduces a pipeline that utilizes Vision-Language Models (VLMs) to generate dense, grounded captions, employing an attribute-retaining metric for evaluation and yielding captions that surpass those from specialized models, along with a dataset of over 2 million annotated panels across 13,000 books.#LongContextReasoning
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries - Michelangelo is a new evaluation framework designed for assessing long-context reasoning in large language models, enabling automatic scoring and deriving insights beyond simple information retrieval. Utilizing the Latent Structure Queries (LSQ) framework, it constructs tasks that require models to filter out irrelevant information to reveal a latent structure, resulting in three diagnostic evaluations that highlight the strengths and weaknesses of state-of-the-art models in synthesizing long-context information.#LLMs
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench - The ability to plan actions to achieve desired outcomes has long been a focus in AI research, particularly regarding the capabilities of LLMs. This paper evaluates the performance of current LLMs and new Large Reasoning Models, specifically OpenAI's o1, on the PlanBench benchmark, revealing that while o1 shows significant improvements in planning abilities, it still has room for enhancement and raises important considerations around accuracy, efficiency, and deployment guarantees.#AIReasoning
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning - Test-time aggregation strategies can enhance the reasoning of large language models (LLMs) by generating multiple samples and voting on them, but they often hit a performance saturation point. The proposed MAgICoRe framework addresses key challenges in LLM refinement—such as excessive and insufficient refinement—by categorizing problems as easy or hard, utilizing external reward models for error localization, and employing a multi-agent system for iterative feedback and improvement, demonstrating effectiveness across five math datasets and outperforming existing methods while using fewer samples.#3DReconstruction
Dynamic 2D Gaussians: Geometrically accurate radiance fields for dynamic objects - This paper introduces Dynamic 2D Gaussians (D-2DGS), a novel representation designed to reconstruct accurate meshes from sparse image inputs, addressing the limitations of current 4D representations that fail to produce high-quality meshes. By utilizing 2D Gaussians for geometry and controlling points to capture deformation, D-2DGS effectively extracts high-quality dynamic mesh sequences from rendered images and depth maps, demonstrating superior performance in mesh reconstruction.#ComputerVision
Statewide Visual Geolocalization in the Wild - This work introduces a method for predicting the geolocation of street-view photos by matching them against a database of aerial imagery within a state-sized search region. The approach partitions the region into geographical cells and trains a model to map these cells and photos into a joint embedding space, achieving successful localization of 60.6% of non-panoramic street-view photos from the Mapillary platform to within 50 meters of their actual locations in Massachusetts.
#OpenSourceAI #MultiModal
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models - The most advanced multimodal models remain proprietary, but Molmo introduces a new family of state-of-the-art open VLMs built from scratch using a novel human-annotated image caption dataset and diverse fine-tuning data. Molmo's 72B model outperforms other open models and even rivals proprietary systems like GPT-4o and Claude 3.5, with plans to release all model weights, data, and code soon.#llama
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models - Llama 3.2 introduces small and medium-sized vision LLMs (11B and 90B) and lightweight text-only models (1B and 3B) optimized for edge and mobile devices, excelling in tasks like summarization and instruction following. These models, available for Qualcomm, MediaTek, and Arm processors, offer easy deployment through Llama Stack distributions, supporting on-device, cloud, and single-node environments, with downloads available on llama.com and Hugging Face.#IBM #NASA
IBM and NASA Release Open-Source AI Model on Hugging Face for Weather and Climate Applications - The new AI foundation model offers insights beyond forecasting for scientists, developers, and businesses to better understand and analyze weather and climate data.#llama #NVIDIA
Llama-3_1-Nemotron-51B-instruct - a new version of Nemotron released, by NVIDIA, which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to price, providing great ‘quality-per-dollar’. Using a novel Neural Architecture Search (NAS) approach we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H100-80GB). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. This model is ready for commercial use.#Mistral
Mistral-Small-Instruct-2409 is an instruct fine-tuned version with the following characteristics: 22B parameters, vocabulary to 32768, supports function calling, 32k sequence length.
#GenAI
https://github.com/NirDiamant/GenAI_Agents - This repository provides tutorials and implementations for various Generative AI Agent techniques, from basic to advanced. It serves as a comprehensive guide for building intelligent, interactive AI systems.#TextToSpeech
https://github.com/lamm-mit/PDF2Audio - This code can be used to convert PDFs into audio podcasts, lectures, summaries, and more. It uses OpenAI's GPT models for text generation and text-to-speech conversion. You can also edit a draft transcript (multiple times) and provide specific comments, or overall directives on how it could be adapted or improved.#Llama #API
https://github.com/meta-llama/llama-stack - This repository contains the Llama Stack API specifications as well as API Providers and Llama Stack Distributions. The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: from model training and fine-tuning, through product evaluation, to building and running AI agents in production.#RL
https://github.com/google-research/circuit_training - AlphaChip: An open-source framework for generating chip floorplans with distributed deep reinforcement learning.#MoE
https://github.com/Time-MoE/Time-MoE - Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.#LLM
https://github.com/quehry/hellobench - an open-source benchmark designed to evaluate the long text generation capabilities of large language models (LLMs). This repository includes the complete test dataset, evaluation code from the associated paper,
#ComputerVision
https://github.com/Stable-X/StableDelight - StableDelight is a cutting-edge solution for specular reflection removal from textured surfaces. Building upon the success of StableNormal, which focused on enhancing stability in monocular normal estimation, StableDelight takes this concept further by applying it to the challenging task of reflection removal.#FileOrganizer
https://github.com/QiuYannnn/Local-File-Organizer - An AI-powered file management tool that ensures privacy by organizing local texts, images, and PDFs.#LangGraph #LangChain
agentic-customer-service-medical-dental-clinic - This software contains an agent based on LangGraph & LangChain for solving general client queries and it could be implemented in whatever channel of this medical clinic (Whatsapp, Telegram, Instagram, etc).#ComputerVision
https://github.com/feevos/tfcl - Official repository for PTAViT3D and PTAViT3DCA models for field boundaries detection using S2 and/or S1 imagery.#LLM
https://github.com/unclecode/crawl4ai - simplifies asynchronous web crawling and data extraction, making it accessible for LLMs and AI applications
#Intel
Next-Generation AI Solutions with Intel Xeon and Gaudi 3 - Intel introduces advanced AI solutions featuring the latest Xeon processors and Gaudi 3 accelerators, designed to enhance performance and efficiency in AI workloads. These innovations aim to provide scalable and optimized computing power for complex AI applications, catering to the needs of scientists and programmers.#SpeechToText
ALD Improvements: Enhancing Automatic Language Detection - AssemblyAI has introduced significant improvements to its Automatic Language Detection (ALD) system, enhancing its accuracy and robustness. These advancements allow for more precise identification of languages in audio files, benefiting various applications in speech recognition and natural language processing.#ModelFinetuning
Using Synthetic Data to Improve Flux Finetunes - The article discusses the application of synthetic data to enhance the performance of Flux, a machine learning library. It details the methodology and benefits of integrating synthetic data for model finetuning, emphasizing improvements in accuracy and efficiency.#LLM
Streaming LLM APIs: A Technical Overview - The article discusses the implementation and benefits of streaming APIs for LLMs. It highlights how streaming can improve response times and user experience by delivering partial results as they are generated.#NLP #LLM #GenAI
Amazon Launches Amelia, a Generative AI-Powered Assistant for Third-Party Sellers - Amazon has introduced Amelia, an AI-driven assistant designed to support third-party sellers on its platform. The assistant leverages generative AI to streamline seller operations, enhance customer interactions, and optimize business processes.#Health
AI Integration in Healthcare: Enhancing Efficiency and Accuracy - AI is increasingly being integrated into healthcare systems to improve diagnostic accuracy and operational efficiency. This technology aids in early disease detection, personalized treatment plans, and streamlining administrative tasks, thereby transforming patient care and medical research.#DeepMind
AlphaChip Revolutionizes Computer Chip Design
DeepMind's AlphaChip leverages AI to optimize computer chip layouts, significantly enhancing performance and efficiency. This breakthrough demonstrates the potential of machine learning in hardware design, reducing design time and improving chip functionality.#OpenAI
OpenAI Introduces Advanced Voice Mode with Expanded Voice Options and New Interface - OpenAI has launched an enhanced voice mode that includes a broader selection of voices and a redesigned user interface. This update aims to improve user experience and provide more versatility for developers integrating voice functionalities.#Meta
Ray-Ban and Meta Unveil New Smart Glasses with AI Integration - Ray-Ban and Meta have introduced their latest smart glasses, featuring advanced AI capabilities and enhanced hardware. These glasses aim to seamlessly integrate augmented reality into everyday life, offering users a blend of fashion and cutting-edge technology.#SAM #ComputerVision #VideoAnnotation
SAM2 for Video: Revolutionizing Video Annotation with AI - The webinar discusses SAM2, an advanced AI tool designed to enhance video annotation processes. It highlights the tool's capabilities in improving accuracy and efficiency for scientists and programmers working with large video datasets.#ColBERT #PALM #NLP
Document Similarity with ColBERT and PALM
The article discusses the implementation of document similarity using ColBERT and PALM models. It provides a detailed explanation of how these models can be used to measure the similarity between documents, focusing on their architecture and performance metrics.#NotebookLM #Google #AudioGeneration #VideoGeneration
NotebookLM Expands to Include Audio and Video Sources - Google's NotebookLM, an AI-powered tool designed to assist with note-taking and information synthesis, now supports audio and video sources. This enhancement allows users to integrate multimedia content into their notes, providing a more comprehensive and versatile research tool.
#ECCV #audio #visual
https://github.com/msu-video-group/ECCVW24_Saliency_Prediction - a novel audio-visual mouse saliency (AViMoS) dataset with the following key-features: diverse content (movie, sports, live, vertical videos, etc), large scale: 1500 videos with mean 19s duration, high resolution (all streams are FullHD), Audio track saved and played to observers, mouse fixations from >5000 observers (>70 per video), license: CC-BY.
#LLama
Meta and Groq Collaborate to Enhance Open-Source Developer Ecosystem with LLaMA 3.2 Launch - Meta and Groq have announced their continued partnership to advance the open-source developer ecosystem, coinciding with the launch of LLaMA 3.2. This collaboration aims to provide developers with enhanced tools and resources, fostering innovation and efficiency in AI and machine learning projects.#OpenAI
OpenAI CTO Mira Murati Steps Down - Mira Murati, the Chief Technology Officer of OpenAI, has announced her departure from the company. Her exit marks a significant change in the leadership of the company.#OpenAI
Next resignations at OpenAI - OpenAI’s chief research officer, Bob McGrew, and vice president of research, Barret Zoph, left the company on Wednesday, hours after the departure of CTO Mira Murati.