Warsaw.AI News 30.09-6.10.2024

Oct 07, 2024

We invite you to check the AI news that we found for you in the week of 30.09-6.10.2024:

#AIRegulation #California
California Governor Vetoes Controversial AI Bill SB 1047
Governor Newsom has vetoed California's AI bill SB 1047, which aimed to regulate artificial intelligence technologies within the state. The bill faced criticism for its potential to stifle innovation and impose excessive regulatory burdens on tech companies.

#Warsaw.AI
We invite you to attend Episode XXIV of Warsaw.AI meetups, this time on 10.10.2024, at 18:00 UTC+2.

#DeepMind #Google #AGI
A Google DeepMind podcast focusing on AGI (Artificial General Intelligence) interviewing several senior researchers at DeepMind, including CEO Demis Hassabis. It's an insightful dive into the future of AI with top experts in the field.
#DistributedTraining
An article about Distributed Training of Deep Learning models, focusing on the challenges and methodologies involved. It discusses the importance of parallelism, data distribution strategies, and the role of hardware accelerators in optimizing training efficiency.

#RNN #LSTM #GRU
“Were RNNs All We Needed?” - This work revisits traditional RNN models, LSTMs, and GRUs, showing that by removing hidden state dependencies from their gates, they can be parallelized during training and achieve comparable performance to newer models. The minimal versions (minLSTMs and minGRUs) use fewer parameters and are 175x faster for long sequences, matching recent sequence models in performance.
#LLM #Multitask #RLHF
“The Perfect Blend: Redefining RLHF with Mixture of Judges” - This work introduces Constrained Generative Policy Optimization (CGPO), a novel post-training method for improving multi-task learning in large language models by addressing challenges like reward hacking and extreme multi-objective optimization. CGPO outperforms standard RLHF algorithms across various tasks, achieving significant gains in chat, STEM, and coding benchmarks, while mitigating reward hacking and providing better generalization.
#Brain
“Largest brain map ever reveals fruit fly’s neurons in exquisite detail” - A fruit fly might not be the smartest organism, but scientists can still learn a lot from its brain. Researchers are hoping to do that now that they have a new map — the most complete for any organism so far — of the brain of a single fruit fly. The wiring diagram, or ‘connectome’, includes nearly 140,000 neurons and captures more than 54.5 million synapses, which are the connections between nerve cells.
#ScalingLaw #LearningRate #LLM #Llama
“Scaling Optimal LR Across Token Horizons” - This work investigates how optimal learning rate (LR) changes with token horizon in LLM training, revealing that longer training requires smaller LRs. It introduces a scaling law for transferring LR across dataset sizes, offering a rule-of-thumb for estimating optimal LR without additional overhead, and suggests that improper LR choices, such as in LLama-1, can lead to performance degradation.
#KnowledgeGraph
“Knowledge Graph Embedding by Normalizing Flows” - This work introduces a novel approach to knowledge graph embedding (KGE) by incorporating uncertainty through group theory, embedding entities and relations as elements of a symmetric group. This method unifies existing KGE models, enhances expressiveness using complex random variables, and ensures computational efficiency, leading to better performance in learning logical rules and handling uncertainty.
#text-to-image #LLM
“ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation” - This work introduces the task of prompt-adaptive workflow generation for text-to-image models, where workflows are automatically tailored to each user prompt. It proposes two LLM-based approaches—one tuning-based and one training-free—that significantly improve image quality compared to monolithic models or prompt-independent workflows, demonstrating a new path to enhancing text-to-image generation.
#LLM
“MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning” - MM1.5 is a new family of multimodal large language models (MLLMs) optimized for text-rich image understanding, visual grounding, and multi-image reasoning. It employs a data-centric approach, with models ranging from 1B to 30B parameters, and introduces specialized variants for video and mobile UI understanding, showcasing the importance of diverse data mixtures and tailored training strategies in enhancing performance.
#LLM
“Law of the Weakest Link: Cross Capabilities of Large Language Models” - The paper introduces CrossEval, a benchmark designed to evaluate LLMs on "cross capabilities," which involve multiple, intersecting skills needed for real-world tasks. Findings show that LLMs consistently underperform in these tasks due to the "Law of the Weakest Link," where the weakest individual ability limits overall cross-capability performance, highlighting a key area for improvement.
#LLM
“Not All LLM Reasoners Are Created Equal” - This study evaluates LLMs ability to solve grade-school math (GSM) problems, focusing on pairs of problems where the second depends on the correct solution of the first. Results show a significant reasoning gap, particularly in smaller, math-specialized models, highlighting difficulties in handling additional context and second-step reasoning, even when models perform well on individual tasks.
#RAG
“Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation” - The Embodied-RAG framework enhances embodied agents by integrating a non-parametric memory system that constructs hierarchical knowledge for navigation and language generation, addressing the unique challenges of multimodal data and perception in robotics. By organizing memory as a semantic forest, Embodied-RAG efficiently generates context-sensitive outputs for over 200 queries across various environments, demonstrating its potential as a general-purpose solution for embodied agents.
#Multimodal
“LLaVA-Critic: Learning to Evaluate Multimodal Models” - LLaVA-Critic is the first open-source large multimodal model designed to evaluate performance across diverse multimodal tasks, trained on a high-quality dataset that includes various evaluation criteria. It demonstrates strong capabilities in both providing reliable evaluation scores comparable to GPT models and generating reward signals for preference learning, highlighting its potential in self-critique and alignment mechanisms for future research.
#ImageRestoration
“Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration” - This paper introduces Posterior-Mean Rectified Flow (PMRF), a novel algorithm for photo-realistic image restoration that optimally minimizes mean squared error (MSE) while ensuring perfect perceptual quality. By first predicting the posterior mean and then transporting this result to match the distribution of ground-truth images, PMRF consistently outperforms existing methods across various restoration tasks.
#LLM
“RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models” - This paper presents Router by Dual Contrastive learning (RouterDC), a novel approach that enhances the assembly of multiple LLMs by effectively selecting the most suitable model for each query using a routing mechanism. Through the use of contrastive learning losses, RouterDC significantly outperforms individual LLMs and existing routing methods, achieving improvements of 2.76% on in-distribution tasks and 1.90% on out-of-distribution tasks.
#Text2Image
“KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models” - KnobGen is a novel dual-pathway framework designed to enhance text-to-image (T2I) generation by accommodating varying levels of sketch complexity and user skill, addressing the limitations of existing models that struggle with balancing precision and control. By utilizing a Coarse-Grained Controller for high-level semantics and a Fine-Grained Controller for detailed refinement, KnobGen allows users to adjust the strength of these modules to produce images that retain the natural appearance, regardless of the user's drawing proficiency. And here is the code.

#Meta #text-to-video
Movie Gen - a video generation model from Meta. You can use simple text inputs to produce custom videos and sounds, edit existing videos or transform your personal image into a unique video.
#Gemini #Google
gemini-1.5-flash-8b - an experimental version of Gemini 1.5 Flash-8B, a smaller and faster variant of 1.5 Flash. We’re now excited to make it generally available for production-use. Flash-8B nearly matches the performance of the 1.5 Flash model launched in May across many benchmarks. It performs especially well on tasks such as chat, transcription, and long context language translation.
#Nvidia #Multimodal
NVLM 1.0, a family of frontier-class multimodal LLMs that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.
#API #DataProcessing
Flux 1.1 Pro and BFL API Launch - Black Forest Labs has announced the release of Flux 1.1 Pro and the BFL API, which aim to improve data processing efficiency and integration capabilities for developers. The new version introduces advanced features such as enhanced data streaming, real-time analytics, and seamless API integration, catering to the needs of scientists and programmers seeking robust data solutions.
Liquid Foundation Models - new generative AI models. According to Liquid, 1B, 3B, and 40B versions achieve state-of-the-art performance in terms of quality at each scale, while maintaining a smaller memory footprint and more efficient inference.

#RL #PyTorch
LeanRL is a lightweight library consisting of single-file, pytorch-based implementations of popular RL algorithms. The primary goal of this library is to inform the RL PyTorch user base of optimization tricks to cut training time by half or more.
#Transformer #SoundDetection
Transformer4SED - This repository aims to collect Transformer-based sound event detection (SED) algorithms.
#LLM
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
#Health #Llama
Doctor Dok is an AI based medical data framework and patient's med vault. Parse any health related PDF/Image to JSON and then use Chat GPT / LLama to discuss it!
#CUDA #Python
bitsandbytes - a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
#FileOrganizer
Local File Organizer is your personal organizing assistant, using cutting-edge AI to bring order to your file chaos - all while respecting your privacy.
#CoT #Sampling
entropix - Entropy Based Sampling and Parallel CoT Decoding.
#Library
Concordia is a library to facilitate construction and use of generative agent-based models to simulate interactions of agents in grounded physical, social, or digital space. It makes it easy and flexible to define environments using an interaction pattern borrowed from tabletop role-playing games in which a special agent called the Game Master (GM) is responsible for simulating the environment where player agents interact (like a narrator in an interactive story).
#LLM #Crawler
Crawl4AI - Open-source LLM Friendly Web Crawler & Scrapper. Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for LLMs and AI applications.
#Meta #Llama #API
llama-stack - This repository contains the Llama Stack API specifications as well as API Providers and Llama Stack Distributions.
#YOLO
Ultralytics YOLO11 is a cutting-edge, state-of-the-art model that builds upon the success of previous YOLO versions and introduces new features and improvements to further boost performance and flexibility. YOLO11 is designed to be fast, accurate, and easy to use, making it an excellent choice for a wide range of object detection and tracking, instance segmentation, image classification and pose estimation tasks.
#Test-to-Speech
TTS - a deep learning toolkit for Text-to-Speech, battle-tested in research and production.
#PairProgramming #LLM
aider is AI pair programming in your terminal. Aider lets you pair program with LLMs, to edit code in your local git repository. Aider works best with GPT-4o & Claude 3.5 Sonnet and can connect to almost any LLM.
#RAG
Controllable-RAG-Agent - This repository provides an advanced Retrieval-Augmented Generation (RAG) solution for complex question answering. It uses sophisticated graph based algorithm to handle the tasks.
#Multimodal
Emu3 - a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.
#RAG
paper-qa - high accuracy RAG for answering questions from scientific documents with citations.

#Anaconda
Anaconda Launches AI Navigator: A Generative AI Desktop Agent
Anaconda has introduced the AI Navigator, a generative AI desktop agent designed to assist data scientists and programmers in their workflows. Download and execute curated open-source AI models locally.
#Amazon #Health
Amazon Pharmacy Streamlines Prescription Management with Technology
Amazon Pharmacy leverages advanced technology to enhance the prescription management process, ensuring efficiency and accuracy from order placement to delivery. By integrating digital tools and automation, Amazon Pharmacy aims to provide a seamless experience for both healthcare providers and patients, reducing errors and improving accessibility to medications.
#Microsoft #DataVisualization
Data Formulator: Enhancing Data Visualization with AI
Microsoft Research introduces the Data Formulator, an AI-driven tool designed to assist analysts in creating complex data visualizations. By leveraging machine learning algorithms, the tool automates the process of selecting optimal visualization techniques, thereby streamlining data analysis and enhancing interpretability for users.
#Meta #3DReconstruction
Digital Twin Catalog: Advancements in 3D Reconstruction by Shopify and Reality Labs Research
Meta's Reality Labs Research and Shopify have collaborated to develop a digital twin catalog, leveraging advanced 3D reconstruction techniques. This initiative aims to enhance e-commerce experiences by creating highly accurate and detailed 3D models of products, facilitating better visualization and interaction for consumers.
#Google
Google Enhances Chromebook Plus with Multi-Functional Quick Insert Key and AI Features - Google has introduced a new multi-functional quick insert key and advanced AI features to the Chromebook Plus, aiming to enhance user productivity and streamline workflows. These updates are designed to provide users with more efficient ways to interact with their devices, leveraging AI to offer smarter suggestions and quicker access to frequently used functions.
#OpenAI #Coding
Canvas: A New AI-Powered Workspace for Writing and Coding - OpenAI's Canvas offers an advanced workspace for scientists and programmers, integrating AI into writing and coding tasks with real-time assistance. The platform enables users to interact with ChatGPT for debugging, inline feedback, and code suggestions, enhancing productivity while maintaining full user control.
#DeepMind #Google #RL #Chip #TPU
AlphaChip: AI-Driven Transformation in Chip Design - AlphaChip, developed by DeepMind, uses AI and reinforcement learning to design superhuman chip layouts, significantly reducing the time and effort required. It has been applied across three generations of Google's TPU chips, improving efficiency, performance, and design speed. AlphaChip's broader impact extends to industry leaders like MediaTek, showcasing its potential to revolutionize the entire chip design lifecycle.
#Meta #GenerativeAI
Meta's AI-Generated Content Raises Concerns - Meta is testing AI-generated posts tailored to users' interests and likenesses in Facebook and Instagram feeds. While this could enhance engagement, concerns arise about the impersonal nature of such content, leading to mixed reactions.

#OpenAI
OpenAI has raised $6.6 billion from investors, which could value the company at $157 billion and cement its position as one of the most valuable private companies in the world.

#VideoUnderstanding
E.T. Bench: Advancing Fine-Grained Event-Level Video Understanding - E.T. Bench introduces a comprehensive benchmark for event-level and time-sensitive video understanding, aiming to improve on the limitations of existing Video-LLMs that focus primarily on video-level comprehension. With 7.3K samples and 7K videos across 8 domains, the benchmark reveals that state-of-the-art models struggle with fine-grained tasks, prompting the development of a new baseline model, E.T. Chat, and a specialized dataset, E.T. Instruct 164K.
#Law
LexEval: A Comprehensive Benchmark for Evaluating Large Language Models in Legal Domain.

Warsaw.AI News

Warsaw.AI News 30.09-6.10.2024

Discussion about this post