Warsaw.AI News 8-14.04.2024

Apr 15, 2024

Hello AI Enthusiasts!

Please check the AI news that we found for you in the week of 8-14.04.2024:

Polish AI Olympiad for high school students - Stage I will start on April 22.

We encourage you to participate in AI INNOVATORS 2024 - a nationwide artificial intelligence hackathon, along with expert panels and networking among experts and AI enthusiasts. Organization: ChallengeRocket in substantive partnership with the Warsaw Stock Exchange. Date and place: April 23, 2:00 p.m., WSE Main Hall, ul. Książęca 4, Warsaw.

“SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing” - A framework to facilitate effective editing of personal content by allowing individuals to swap any objects in an image with personalized concepts from a reference while maintaining the original context. Compared to existing methods, SwapAnything offers precise control over arbitrary objects, preserves context pixels more faithfully, and better adapts personalized concepts to the image, demonstrated through extensive human and automatic evaluations across various swapping tasks.
“UniFL: Improve Stable Diffusion via Unified Feedback Learning” - A unified framework designed to address limitations in current diffusion models by employing feedback learning to enhance visual quality, aesthetic appeal, and inference speed. Through its components, including perceptual feedback learning, decoupled feedback learning, and adversarial feedback learning, UniFL demonstrates superior performance in both model quality enhancement and acceleration across various diffusion models, validated through extensive experiments and user studies.
“Learning agile soccer skills for a bipedal robot with deep reinforcement learning” - The authors trained a humanoid robot to play a simplified one-versus-one soccer game, resulting in an agent with robust and dynamic movement skills and adaptive tactical behavior. The agent exhibited improved performance in walking, turning, ball kicking, and recovery compared to a scripted baseline, demonstrating the efficacy of deep RL for synthesizing sophisticated and safe movement strategies for humanoid robots.
“OpenEQA: From word models to world models” - The Open-Vocabulary Embodied Question Answering (OpenEQA) benchmark assesses AI agents' comprehension of physical spaces through questions like "Where did I leave my badge?". The release of OpenEQA aims to inspire open research aimed at enhancing AI agents' ability to understand and communicate about the world they observe.
“DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation” - "DreamView" addresses the challenge of customizing specific appearances in 3D generation while adhering to an overall text description. By incorporating view-specific and overall text guidance through a collaborative text guidance injection module, it enables multi-view customization while maintaining consistency, empowering artists to design innovative and diverse 3D objects.
“Policy-Guided Diffusion“ - Policy-guided diffusion offers a solution to the distribution shift problem in offline reinforcement learning by generating synthetic trajectories under the behavior distribution while using guidance from the target policy. This approach balances action likelihood under both policies, leading to plausible trajectories with high target policy probability and lower dynamics error compared to offline world model baselines, thereby improving performance across various reinforcement learning algorithms and environments.
“Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks” - Ada-LEval addresses the need for precise evaluation of large LLMs on extremely long documents by introducing two challenging subsets, TSort and BestAnswer, enabling reliable assessment of LLMs' long-context understanding. Unlike existing benchmarks, Ada-LEval supports intricate manipulation of test case lengths, covering ultralong settings up to 128k tokens, and evaluation results reveal the limitations of current LLMs, particularly in ultra-long-context scenarios.
“Hash3D: Training-free Acceleration for 3D Generation” - Hash3D introduces an innovative approach to accelerate 3D generative modeling by exploiting feature-map redundancy across neighboring timesteps and camera angles, achieved through adaptive grid-based hashing. This acceleration technique significantly improves efficiency, speeding up optimization by up to 4 times across various text-to-3D and image-to-3D models while also enhancing the smoothness and view consistency of synthesized 3D objects.
“MoCha-Stereo: Motif Channel Attention Network for Stereo Matching” - MoCha-Stereo introduces a novel Motif Channel Attention approach to stereo matching, addressing edge detail mismatches by leveraging the Motif Channel Correlation Volume to enhance accuracy in matching costs. By incorporating the Reconstruction Error Motif Penalty module, which refines disparity estimation based on frequency information from reconstruction error features, MoCha-Stereo achieves top performance on the KITTI-2015 and KITTI-2012 leaderboards while demonstrating excellence in Multi-View Stereo tasks.
“Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression” - PoLoPCAC presents a novel approach to lossless point cloud attribute compression (PCAC), achieving high compression efficiency and strong generalizability by inferring explicit attribute distributions from group-wise autoregressive priors. Through a progressive random grouping strategy and a locality-aware attention mechanism, PoLoPCAC efficiently resolves point clouds into groups and models their attributes sequentially, demonstrating continuous bit-rate reduction over existing methods while boasting shorter coding times and a lightweight model size suitable for practical applications.
“SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection“ - SplatPose is introduced as a solution to detect anomalies in 3D objects captured from varying viewpoints, overcoming the limitations of existing methods by employing a novel 3D Gaussian splatting-based framework. By accurately estimating pose and detecting anomalies in unseen views, SplatPose achieves state-of-the-art results in training speed, inference speed, and detection performance, even with limited training data, as demonstrated on the Pose-agnostic Anomaly Detection benchmark and its multi-pose anomaly detection (MAD) dataset.
“SambaLingo: Teaching Large Language Models New Languages”- The authors delved into the adaptation of large language models to new languages, exploring various components such as vocabulary extension, direct preference optimization, and addressing data scarcity challenges. Scaling experiments across nine languages and two parameter scales, they outperformed previous baselines including Llama 2, Aya-101, XGLM, and BLOOM, demonstrating significant advancements in LLM adaptation to diverse linguistic contexts.
“Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs” - The authors introduced Ferret-UI, a multimodal large language model (MLLM) specifically designed to comprehend and interact with mobile user interface (UI) screens, featuring enhanced understanding through referring, grounding, and reasoning capabilities. By incorporating "any resolution" to magnify details and leveraging enhanced visual features, Ferret-UI demonstrates outstanding performance in tasks such as icon recognition, text finding, and widget listing, surpassing both open-source UI MLLMs and GPT-4V on various elementary UI tasks.
“SOAR: New algorithms for even faster vector search with ScaN“ - SOAR is an algorithmic improvement to vector search that introduces effective and low-overhead redundancy to ScaNN, Google’s vector search library, making ScaNN even more efficient.
“Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention“ - This work presents a method for scaling Transformer-based LLMs to handle infinitely long inputs while maintaining bounded memory and computation. The approach introduces Infini-attention, a new attention mechanism that incorporates compressive memory, masked local attention, and long-term linear attention within a single Transformer block, demonstrating its effectiveness in long-context language modeling tasks and enabling fast streaming inference for LLMs.
“Measuring the Persuasiveness of Language Models” - A study comparing persuasiveness found that the Claude 3 Opus AI model closely matched human persuasiveness, with statistical tests showing no significant difference, though humans were slightly more persuasive. The results underscore the growing convincing ability of advanced AI models like Claude 3 Opus, confirmed by a control condition demonstrating negligible persuasiveness for indisputable facts.

InstantMesh: Efficient 3D Mesh Generation - InstantMesh by TencentARC efficiently generates 3D meshes from single images, enhancing computer vision and graphic applications with sparse-view large reconstruction models.
Simple-Evals: OpenAI's Lightweight Language Model Evaluation Library - OpenAI's GitHub repository "simple-evals" offers a lightweight library designed to transparently evaluate the accuracy of language models, with a focus on zero-shot, chain-of-thought assessments. It supports various benchmarks to test models like GPT-4 and is not intended for active maintenance.
SqueezeAttention: Optimizing LLM KV-Cache Allocation - SqueezeAttention is a GitHub project aimed at optimizing the Key-Value cache of Large Language Models by managing cache allocation across attention layers, resulting in significant memory reductions and throughput improvements.
Anthropic Cookbook: Enhancing AI Development with Claude -
The Anthropic Cookbook is a practical resource on GitHub designed to help developers integrate and utilize the Claude AI model effectively. It includes a variety of code snippets and guides, facilitating the application of Claude in diverse projects.
Open-Sora-Plan: Advancing Text-to-Video AI Models - Open-Sora-Plan is an open-source initiative aiming to replicate and enhance OpenAI's Sora model for text-to-video applications, focusing on improving video generation quality and enabling higher resolution and longer duration outputs.
IPEX-LLM: Accelerating LLM Performance on Intel Hardware - IPEX-LLM is a PyTorch library designed to optimize local LLM inference and fine-tuning on Intel CPUs and GPUs, enhancing performance for various large language models through seamless integration with tools like HuggingFace and DeepSpeed.
Chronon: Streamlining AI/ML Data Handling - Chronon, developed by Airbnb, is a data platform designed to simplify complex data operations for AI/ML applications, enabling efficient management of batch and real-time data processing with features like scalable backfills and low-latency serving.
VAR: Redefining Image Generation with Autoregressive Models - The VAR project on GitHub introduces a novel approach to autoregressive visual generation, significantly advancing beyond traditional diffusion models and exploring scaling laws in image generation, aimed at enhancing GPT-style model capabilities in visual tasks.
ChemBench: Enhancing Chemistry Benchmarks for LLMs - ChemBench is a GitHub project that aims to expand the scope of chemistry benchmarks for Large Language Models (LLMs), focusing on creating a comprehensive testing suite that is compatible with existing benchmarks like BIG-bench.
llm.c: Simplifying LLM Training with Pure C/CUDA - llm.c is a GitHub project that simplifies LLM training using raw C/CUDA, eliminating the need for heavy dependencies like PyTorch. This approach allows for efficient training of models like GPT-2 directly in C, with an emphasis on minimalism and performance.
AutoCodeRover: Streamlining Code Improvements - AutoCodeRover on GitHub autonomously enhances software development by automating task resolution like bug fixes and feature implementations using AI, with a success rate of 15.95% on the SWE-bench tasks.
Conceptual Depth: Probing LLMs for Layer-Specific Insights - The GitHub project "Conceptual Depth" utilizes probing techniques to study and evaluate the concept learning processes within different layers of Large Language Models, offering insights to improve model training effectiveness.
JetStream: Optimizing LLM Inference on XLA Devices - Google's JetStream enhances LLM inference efficiency on XLA devices like TPUs, aiming for future GPU support to maximize throughput and memory usage in AI applications.
Attorch: Enhancing PyTorch with OpenAI's Triton - Attorch redefines PyTorch's neural network modules using OpenAI's Triton for greater computational efficiency, designed for both research and development in machine learning.
Flyflow: Enhancing LLM Middleware Performance - Flyflow is an API middleware designed to optimize LLM applications by improving response quality and reducing latency, providing a drop-in replacement for existing APIs like OpenAI's, with enhanced security features and higher token limits.

Gemma Family Expands with Models Tailored for Developers and Researchers - Google announced its first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation. Plus, they've shared some updates to Gemma and terms aimed at improvements based on feedback from the community and our partners.
Rerank 3: A New Foundation Model for Efficient Enterprise Search & RAG systems.
JetMoe - Trained on public datasets with a modest amount of computational resources, this mixture-of-experts model achieves comparable performance to Meta's larger and more costly Llama 2 7B model.
Grok-1.5V - A multimodal model with enhanced visual processing capabilities, empowering it to analyze diverse visual data types alongside text. Early testers and current Grok users can anticipate access to Grok-1.5V in the near future, broadening its utility across various domains.
Stable LM 2 12B is a pair of powerful 12 billion parameter language models trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch.
Mixtral-8x22B - a pretrained generative Sparse Mixture of Experts, with ~176B params (~44B active during inference), 65K context window, 8 experts, 2 per token.
RMBG v1.4 is a background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale. The accuracy, efficiency, and versatility currently rival leading source-available models. It is ideal where content safety, legally licensed datasets, and bias mitigation are paramount.
Parler-TTS Mini v0.1 is a lightweight text-to-speech (TTS) model, trained on 10.5K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt.

Integrating Custom Tools with Claude for Enhanced Functionality
The guide details the process for developers to enhance Anthropic's Claude AI with custom tools via the API, illustrating how these tools can interact with Claude to extend its capabilities for specific tasks in diverse domains.
Advancing AI Capabilities at Google Cloud Next '24: Introducing Gemini 1.5 Pro and Imagen 2.0 - At Google Cloud Next '24, Google announced significant updates to Vertex AI, including the public preview of Gemini 1.5 Pro with the world's largest context window, and new functionalities in Imagen 2.0 like live image generation from text and advanced image editing tools. These updates aim to enhance the development, deployment, and maintenance of generative AI applications.
Enhancing Software Development with Gemini Code Assist - Gemini Code Assist, newly introduced by Google, integrates AI into IDEs like VS Code to boost development productivity through features like AI-driven code completion, natural language interactions, and customizable code intelligence using private codebases.
Meta Accelerates AI Development with Enhanced MTIA Chips - Meta's updated MTIA chips significantly improve the efficiency of AI model training, aiming to expand their use beyond ranking and recommendation algorithms to include generative AI capabilities.
Rethinking Product Design with Large Language Models - This article provides a mental model for integrating Large Language Models into products, emphasizing the importance of adjusting expectations and workflows to leverage LLMs effectively without overestimating their precision.
Gemini for Google Cloud: Unveiling Enhanced AI Capabilities - This article introduces Gemini for Google Cloud, detailing its advanced features and integration across Google's services. It highlights Gemini's state-of-the-art AI capabilities, particularly in handling extensive datasets with its long-context window, and its versatility across different platforms including mobile and cloud-based applications.
Intel unveiled the Gaudi 3 AI accelerator chip, positioning it as a competitive alternative to Nvidia's H100 for running large language models like ChatGPT. The company claims that Gaudi 3 offers 50% faster training time and inference performance compared to Nvidia's H100, particularly evident when running models such as OpenAI's GPT-3 175B and Meta's Llama 2.

The Art of Product Management in AI-Driven Environments - This article discusses strategies for product management within the unpredictable realm of AI, emphasizing adaptive design and continuous evaluation to handle non-deterministic outputs.
Google Cloud Next 2024: Key Announcements and Innovations - The article summarizes the highlights from the Google Cloud Next 2024 event, detailing new AI-driven features and tools for developers, enhanced security solutions, and Google's advancements in cloud and AI technology.

Enjoy!

Warsaw.AI News Team

P.S: This newsletter is free, but if you enjoy it, you are welcome to donate to help us make it even better (the readers paying taxes in Poland can also contribute 1.5% of their taxes to support us).