Warsaw.AI News 17-23.06.2024

Jun 24, 2024

Hello AI Enthusiasts!

We invite you to check the AI news that we found for you in the week of 17-23.06.2024:

The video recording of Episode XXIII of Warsaw.AI (23.05.2024) is now available on YouTube.

“Step-by-Step Diffusion: An Elementary Tutorial” - course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience.

“Language Modeling with Editable External Knowledge” - This paper introduces ERASE which enhances language models' adaptation to new information by incrementally deleting or rewriting entries in the knowledge base when new documents are added, rather than just focusing on retrieval during prediction. This approach improves accuracy in answering questions about evolving news or conversations by 7-13% for Mixtral-8x7B and 6-10% for Llama-3-8B compared to traditional retrieval-augmented generation methods.
“Duoduo CLIP: Efficient 3D Understanding with Multi-View Images” - Duoduo CLIP enhances 3D representation learning by using multi-view images, leveraging 2D priors from CLIP models, which improves generalization, reduces GPU requirements, and shortens training time compared to point cloud methods. It uses cross-view attention to boost performance, achieving superior text-to-shape alignment and object retrieval, while requiring significantly less computational power and fewer parameters than the current SOTA point cloud methods.
“PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers“ - This study explores using LLMs for complex data-driven decision-making, defining Decision QA as finding the best decision for a question based on business rules and a database. It introduces the Decision QA benchmark with two scenarios from video games and proposes the PlanRAG technique, which outperforms the state-of-the-art iterative RAG method by 15.8% and 7.4% in the Locating and Building scenarios, respectively.
“ChangeViT: Unleashing Plain Vision Transformers for Change Detection“ - Change detection in remote sensing images is crucial for monitoring environmental changes. This paper introduces ChangeViT, a framework using vision transformers (ViTs) to improve detection of large-scale changes, complemented by a detail-capture module and a feature injector for fine-grained information integration. ChangeViT outperforms existing methods on multiple high-resolution and low-resolution datasets, demonstrating the untapped potential of plain ViTs for comprehensive change detection.
“LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging” - Recent studies indicate that reducing CNN layers can boost efficiency without sacrificing performance. We introduce LayerMerge, a novel depth compression method that prunes both convolution and activation layers to improve inference speed while minimizing performance loss. By solving the exponential search space problem using dynamic programming, LayerMerge consistently outperforms existing methods in image classification and generation tasks.
“Adversarial Attacks on Multimodal Agents“ - Vision-enabled language models have advanced autonomous multimodal agents capable of actions like purchasing or code editing. This study explores the vulnerability of these agents to adversarial attacks, demonstrating that a single manipulated image can guide agents to achieve targeted malicious goals. The researchers introduce VisualWebArena-Adv, a set of adversarial tasks to evaluate agent robustness and aim to inform the development of more resilient multimodal agents.
“GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Model” - Recent advances in geometry estimation models show that discriminative methods relying on large-scale fine-tuning data achieve superior zero-shot generalization compared to generative-based models using small-scale synthetic data. A comprehensive evaluation reveals that fine-tuning data quality is more crucial than data scale and model architecture, challenging the necessity of complex generative models for depth estimation.
“Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks” - artificial neural networks optimize for unidirectional value propagation, unlike biological neurons which can propagate signals in multiple directions. The proposed Hierarchical Correlation Reconstruction models joint distributions for multidirectional propagation, using polynomial basis parameterization to allow flexible, inexpensive processing, trained via backpropagation or tensor decomposition, and extending to higher order dependencies for interpretable, multidirectional value and probability density propagation.
“DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence” - the paper introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts code language model that matches GPT4-Turbo in code-specific tasks. Pre-trained with an additional 6 trillion tokens, it enhances coding and mathematical reasoning capabilities, supports 338 programming languages, extends context length to 128K, and outperforms models like GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks. The code is available in a Github repository.
“Adversarial Attacks on Multimodal Agents” - In this study, vision-enabled language models (VLMs) used in autonomous multimodal agents are shown to introduce new safety vulnerabilities. Adversarial attacks leveraging text strings perturb a trigger image within an 𝐿∞-norm of 16/256, achieving success rates of 75% on captioner-augmented GPT-4V agents. These attacks highlight vulnerabilities in white-box captioners and CLIP models across various VLMs, underscoring the need for robust defenses against such adversarial strategies in multimodal agent systems.
“DataComp-LM: In search of the next generation of training sets for language models” - the paper introduces DataComp for Language Models (DCLM), a testbed designed to enhance language models through controlled dataset experiments. DCLM provides a standardized corpus from Common Crawl, effective pretraining strategies via the OpenLM framework, and evaluates models across 53 downstream tasks. The baseline study within DCLM demonstrates that model-based filtering significantly improves dataset quality, enabling a 7B parameter language model to achieve competitive performance on multiple benchmarks with reduced computational resources compared to previous state-of-the-art models like MAP-Neo and Llama 3 8B. These findings underscore the critical role of dataset curation in optimizing language model training outcomes and set a foundation for further advancements in this area.
“Task Me Anything” - This paper introduces Task-Me-Anything, a versatile benchmark generation engine that tailors benchmarks to user specifications by leveraging an extensive taxonomy of visual assets. It programmatically generates a wide range of task instances and efficiently addresses user queries about the performance of machine learning models within given computational constraints. Task-Me-Anything includes a rich dataset comprising images, videos, 3D objects, object categories, attributes, and relationships, enabling the creation of 750M image/video question-answering pairs that assess the perceptual capabilities of language models.
“Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities“ - This paper introduces whiteboard-of-thought prompting, a method that enhances multimodal LLMs by allowing them to utilize visual reasoning capabilities through metaphorical 'whiteboards' for drawing and processing reasoning steps as images. By leveraging existing model capabilities in libraries like Matplotlib and Turtle, this approach achieves state-of-the-art performance on challenging natural language tasks involving visual and spatial reasoning. Compared to conventional chain-of-thought methods, whiteboard-of-thought enables significantly higher accuracy, reaching up to 92% accuracy in settings where the former method fails entirely, as demonstrated across multiple evaluations.
“OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations” - This paper introduces OGNI-DC, a novel depth completion framework that utilizes "Optimization-Guided Neural Iterations" (OGNI) to refine depth gradients and integrate them into dense depth maps. OGNI-DC demonstrates robust generalization capabilities, surpassing baseline methods by a significant margin on diverse datasets and different levels of sparsity. It achieves state-of-the-art performance on benchmark datasets such as NYUv2 and KITTI, highlighting its high accuracy and effectiveness in depth completion tasks.
“TroL: Traversal of Layers for Large Language and Vision Models” - This paper introduces TroL, a novel efficient LLVM family with model sizes of 1.8B, 3.8B, and 7B parameters, which employs a layer traversing technique to reuse layers in a token-wise manner. TroL demonstrates superior performance over existing open-source LLVMs with larger model sizes and competes closely with closed-source LLVMs like GPT-4V, known for their powerful vision language capabilities, while mitigating the high computational costs associated with training and inference.

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models - Nemotron-4 340B - It is a family of models optimized for NVIDIA NeMo and NVIDIA TensorRT-LLM, includes cutting-edge instruct and reward models, and a dataset for generative AI training.
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation.
Anthropic introduced Claude 3.5 Sonnet outperforming competitor models and Claude 3 Opus on a wide range of evaluations.
Meta released New AI Research Models to Accelerate Innovation at Scale - Chameleon can process and generate both text and images. JASCO is capable of accepting various inputs, such as chords or beat, to improve control over generated music outputs. AudioSeal is an audio watermarking technique designed specifically for the localized detection of AI-generated speech.
mamba2-hybrid-8b-3t-4k - 8B-parameter model released by NVIDIA, made of Mamba-2, attention, and MLP layers) trained for the paper An Empirical Study of Mamba-based Language Models.
Granite Code Models - a family of open foundation models for code generative tasks (e.g., fixing bugs, explaining code, documenting code), trained with code written in 116 programming languages.

Warp - a Python framework for writing high-performance simulation and graphics code. Warp takes regular Python functions and JIT compiles them to efficient kernel code that can run on the CPU or GPU.
mlx-gpt2 - ~200 lines of Python code that define and train GPT-2 from scratch using mlx and numpy as the only dependencies.
Mistral Cookbook features examples contributed by the Mistral AI community and partners.
Cognita - RAG Framework for building modular, open-source applications for production by TrueFoundry.
Vanna - an MIT-licensed open-source Python RAG framework for SQL generation and related functionality.
LARS - an application that enables you to run LLMs locally on your device, upload your own documents and engage in conversations wherein the LLM grounds its responses with your uploaded content.
Argilla is a collaboration platform for AI engineers and domain experts who require high-quality outputs, full data ownership, and overall efficiency.

Introducing Gen-3 Alpha: A New Frontier for Video Generation - Gen-3 Alpha is RunwayML's first model trained on a new infrastructure for large-scale multimodal training, significantly improving fidelity, consistency, and motion. The model supports text-to-video and image-to-video generation, with advanced control features. Gen-3 Alpha introduces new safeguards, including improved visual moderation and C2PA provenance standards.
Core ML: Introduction to Apple's Machine Learning Framework - Core ML is Apple's framework for integrating machine learning models into iOS and macOS applications, supporting various model types and offering performance optimization tools. It allows developers to easily deploy AI features directly on Apple devices, supporting computer vision, natural language processing, and more.
DeepMind's New AI Generates Soundtracks and Dialog for Videos - DeepMind has developed new AI technology that automatically generates soundtracks and dialogue for videos by analyzing visual content and creating realistic sound effects. This technology can significantly enhance video production quality, offering more immersive audiovisual experiences.
Mindmap Generator: Tool for Automatic Mind Map Creation
Mindmap Generator is an online tool that automatically generates interactive mind maps from input text, helping users organize and visualize information. It allows the creation of structured, easily editable mind maps, supporting learning, planning, and brainstorming processes.
Generating Audio for Video - DeepMind introduces a new approach to generating audio for video, combining AI models for visual and audio analysis to automatically create realistic sound effects based on visual content. This technology can enhance video production quality, providing more immersive audiovisual experiences.
Video to Sound Effects: Tool for Converting Videos to Sound Effects - Video to Sound Effects is a platform that allows users to generate sound effects from videos using advanced content analysis and sound synthesis algorithms. It enables automatic addition of sounds to video clips, enhancing their quality and immersion.
Logit Prisms: Insight into the Inner Workings of Language Models -
Logit Prisms is a visualization tool for analyzing the internal mechanisms of language models, allowing the examination of logit distribution changes during text generation and understanding the model's decision-making process. The tool aids users in better interpreting AI models by visualizing the response generation process.
Introducing AutoGen Studio: A Low-Code Interface for Building Multi-Agent Workflows - Microsoft Research introduces AutoGen Studio, a low-code interface for rapidly creating, testing, and sharing multi-agent solutions. Built on the AutoGen framework, it allows users to build and customize agents and workflows with minimal coding. AutoGen Studio supports creating agents for various tasks, such as travel planning and financial management, through an interactive interface and easy debugging.
Dot - a standalone, open-source application designed for seamless interaction with documents and files using local LLMs and RAG. It is inspired by solutions like Nvidia's Chat with RTX, providing a user-friendly interface for those without a programming background. Using the Phi-3 LLM by default, Dot ensures accessibility and simplicity right out of the box.
https://github.com/jupyter-naas/awesome-notebooks - The objective of this repository is to create the largest catalog of production-ready Jupyter Notebooks templates. With those templates, it becomes easy to create data products (analytical dashboards, automation/AI engines and more).
WebCanvas - a pioneering online evaluation framework designed to address the dynamic nature of web interactions. It provides a realistic assessment of autonomous web agents by utilizing live web environments and emphasizing task completion through the identification of key nodes.
CIFAR-10 Airbench - this project contains state-of-the-art fast training methods for CIFAR-10.

TokenCost: Tool for Estimating Token Costs for Over 400 LLMs -
TokenCost is a CLI tool for quickly and easily calculating the cost of using major Large Language Model (LLM) APIs, such as GPT-3.5 and GPT-4. It allows precise token counting for prompts and responses, integrates with OpenAI models and others, provides updated prices, and supports various data formats.
Nvidia Surpasses Microsoft in Market Cap, Becomes Most Valuable Public Company - Nvidia has become the most valuable public company, surpassing Microsoft in market capitalization due to the surging demand for their graphics processors, which are crucial for the advancement of artificial intelligence and machine learning.
Former OpenAI Chief Scientist Ilya Sutskever Introduces Safe Superintelligence Initiative - Ilya Sutskever, former chief scientist at OpenAI, announced the Safe Superintelligence Initiative (SSI) aimed at developing frameworks and technologies to ensure the safe development of superintelligence, mitigating potential risks associated with advanced AI.

HelpSteer2: Open-source Dataset for Training Top-Performing Reward Models -
HelpSteer2 is a CC-BY-4.0 licensed preference dataset for training reward models, containing 10,000 response pairs. Reward models trained on HelpSteer2 achieve state-of-the-art (92.0%) on Reward-Bench's primary dataset, outperforming existing open and proprietary models, effectively aligning LLMs.

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks - BigCodeBench is a new benchmark evaluating large language models' (LLM) performance in solving complex programming tasks. It includes 1,140 tasks requiring diverse library and function usage and features two evaluation variants: function completion and instruction interpretation.
Nvidia Conquers Latest AI Tests: GPU Maker Tops New MLPerf Benchmarks -
Nvidia dominates the new MLPerf tests, setting records in benchmarks for fine-tuning large language models and graph neural networks with a system comprising 11,616 H100 GPUs, demonstrating exceptional scalability and software optimization.