Warsaw.AI News 3-9.06

Jun 10, 2024

Hello AI Enthusiasts!

Please check the AI news that we found for you in the week of 3-9.06.2024:

An Interview with the Most Prolific Jailbreaker of ChatGPT and Other Leading LLMs - Pliny the Prompter, known for breaking the security of leading language models (LLMs) like ChatGPT, discusses his methods, motivations, and the impact of his work on the AI industry.

AI & Education (12.06.2024, 18:00 UTC+2, WWSI) - a meetup organized jointly by Warsaw.AI and ML in PL Association. With speakers from Warsaw universities and the organizers of the Polish Olympiad in AI, we will learn potentially opposing views on the future of the role of AI in education and discuss how we can teach future generations to use AI in all professions, not only among engineers.

Course: AI Agents in LangGraph - This course teaches how to build AI agents using LangGraph, covering LangChain framework components, techniques like agentic search, state management, and human-in-the-loop integration.
Anthropic's comprehensive tool use tutorial - Across six lessons, to learn how to implement tool use successfully in workflows with Claude.

“MegActor: Harness the Power of Raw Video for Vivid Portrait Animation” - Raw driving videos contain richer facial expression information than intermediate representations like landmarks, but research is limited due to challenges of identity leakage and performance degradation from irrelevant details. To address these, the authors propose MegActor, a conditional diffusion model that uses synthetic data to mitigate ID leakage, segments foreground and background with CLIP encoding for stability, and applies style transfer to eliminate facial detail influence, achieving results comparable to commercial models with training on public datasets.
“MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark” - Benchmarks like the Massive Multitask Language Understanding (MMLU) have advanced AI comprehension and reasoning but are now showing performance plateaus among improving models. To address this, MMLU-Pro introduces more challenging, reasoning-focused questions and expands answer choices, resulting in significantly lower accuracy and greater stability under varied prompts, with Chain of Thought reasoning showing improved performance, highlighting its ability to better discriminate model capabilities.
“Improved Techniques for Optimization-Based Jailbreaking on Large Language Models“ - LLMs require robust safety alignment for widespread deployment, with the Greedy Coordinate Gradient (GCG) attack drawing significant attention for optimization-based jailbreaking techniques despite its inefficiencies. This paper introduces improved methods, including diverse harmful target templates and an automatic multi-coordinate updating strategy, to enhance GCG's performance, resulting in the new I-GCG method which achieves nearly 100% attack success rate in benchmarks like the NeurIPS 2023 Red Teaming Track.
“ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation” - Video generation has advanced significantly with the advent of video diffusion models, such as Stable Video Diffusion, though most models generate low frame rate videos due to GPU memory limitations and the challenge of modeling many frames. This paper introduces a training-free video interpolation method for generative video diffusion models, which uses a self-cascaded architecture and hidden state correction modules to maintain temporal consistency, demonstrating effectiveness comparable to trained interpolation models across various video models.
“SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales“ - LLMs often produce inaccurate information and lack reliable confidence indicators, limiting their applications. SaySelf, a new training framework, enhances LLMs' fine-grained confidence estimates and self-reflective rationales by summarizing uncertainties from multiple reasoning chains, using supervised fine-tuning and reinforcement learning to calibrate confidence, resulting in reduced calibration errors and maintained task performance.
“LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing” - LLMs excel in knowledge memorization but struggle with domain-specific tasks like medical applications and classification questions. To address this, the authors propose LlamaCare, a fine-tuned medical language model, and the Extended Classification Integration module to enhance classification performance; achieving ChatGPT-level performance with low carbon emissions on a 24G GPU, reducing redundant categorical answers, and providing processed data for one-shot and few-shot training on benchmarks like PubMedQA and USMLE, all while using fewer GPU resources.
“Vision-LSTM: xLSTM as Generic Vision Backbone” - Transformers, originally for natural language processing, are now widely used in computer vision. This paper introduces Vision-LSTM, an adaptation of the scalable xLSTM architecture for computer vision, which processes patch tokens in alternating directions through a stack of xLSTM blocks, showing potential as a new generic backbone for computer vision architectures.
“Extracting Concepts from GPT-4“ - We currently lack an understanding of neural activity in language models. OpenAI’s improved methods for identifying "features"—patterns of activity—scale better than existing techniques and have identified 16 million features in GPT-4, aiming for human interpretability.
“Dragonfly: A large vision-language model with multi-resolution zoom” - Dragonfly is an instruction-tuning vision-language architecture that enhances fine-grained visual understanding and reasoning about image regions using multi-resolution zoom-and-select. Alongside the architecture, the authors released two open-source models: Llama-3-8b-Dragonfly-v1 for general domains and Llama-3-8b-Dragonfly-Med-v1 for biomedical data, both demonstrating superior performance on vision-language benchmarks and medical imaging tasks.
“MetaMixer Is All You Need” - Recent research suggests that Feed-Forward Networks (FFNs), akin to the query-key-value mechanism within self-attention, function as memory networks. To validate this, a new approach called FFNification converts self-attention into an FFN-like token mixer, showing remarkable performance improvements across various tasks. This leads to the introduction of MetaMixer, a general mixer architecture using simple operations like convolution and GELU to achieve superior performance without specifying sub-operations within the query-key-value framework.
“ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization“ - Visual Geo-localization (VG) involves identifying locations depicted in query images, crucial for applications like robotics, autonomous driving, and augmented reality. To address challenges in fine-grained images lacking specific text descriptions, the authors proposed a two-stage training method leveraging CLIP's multi-modal capability to generate vague descriptions and dynamic text prompts to enhance visual features, achieving competitive results on large-scale VG datasets.
“ReLUs Are Sufficient for Learning Implicit Neural Representations” - This paper revisits the use of ReLU activation functions in learning implicit neural representations, inspired by theoretical insights and second-order B-spline wavelets. By incorporating constraints to remedy spectral bias, ReLU neurons in deep neural networks can achieve state-of-the-art performance in various INR tasks, providing a principled approach to hyperparameter selection and demonstrating effectiveness across signal representation, super-resolution, and computed tomography experiments.
“To Believe or Not to Believe Your LLM“ - This study investigates uncertainty quantification in large language models (LLMs) to detect unreliable responses given a query, considering both epistemic and aleatoric uncertainties. By deriving an information-theoretic metric, the model reliably identifies instances of high epistemic uncertainty, crucial for detecting hallucinations in both single- and multi-answer responses, a capability not present in many standard uncertainty quantification strategies. Experimental results demonstrate the effectiveness of this approach, shedding light on the amplification of probabilities assigned to outputs by LLMs through iterative prompting.
“XRec: Large Language Models for Explainable Recommendation” - Recommender systems alleviate information overload by tailoring recommendations to user preferences, with Explainable recommendations aiming to enhance transparency in decision-making. Leveraging the language capabilities of LLMs, this work introduces XRec, a model-agnostic framework that empowers LLMs to provide comprehensive explanations for recommendations, integrating collaborative signals to understand complex user-item interactions and outperforming baseline approaches in explainable recommender systems through extensive experiments.
“GrootVL: Tree Topology is All You Need in State Space Model” - The GrootVL network addresses the limitations of state space models by dynamically generating a tree topology based on spatial relationships and input features, enabling stronger representation capabilities by breaking sequence constraints. Additionally, a linear complexity dynamic programming algorithm enhances long-range interactions without increasing computational cost, resulting in significant performance improvements across visual tasks like image classification, object detection, and segmentation, as well as textual tasks with minor training cost.

https://github.com/coniferlm/conifer - Code for reproducing the Conifer dataset from the paper Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language. Models trained with Conifer dataset have remarkable improvements in instruction following.
https://github.com/lichao-sun/mora - Multi-agent framework designed to facilitate generalist video generation tasks, leveraging a collaborative approach with multiple visual agents. It aims to replicate and extend the capabilities of OpenAI's Sora. Works on many tasks like: text-to-video generation, video-to-video editing, connecting videos and others.
https://github.com/onuratakan/gpt-computer-assistant - GPT-4o for Windows, MacOS, and Ubuntu, an alternative to the official app for MacOS.
https://github.com/lllyasviel/Omost - Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability.
https://github.com/ridgerchu/matmulfreellm - Language Model without need for multiplication of matrices, compatible with HuggingFace transformers.
LitGPT is a command-line tool designed to easily finetune, pretrain, evaluate, and deploy 20+ LLMs on your own data. It features highly optimized training recipes for the world's most powerful open-source LLMs.

Mobius is a diffusion model that pushes the boundaries of domain-agnostic debiasing and representation realignment. By employing a brand new constructive deconstruction framework, Mobius achieves unrivaled generalization across a vast array of styles and domains, eliminating the need for expensive pretraining from scratch.
RMBG v1.4 is a state-of-the-art background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale. The accuracy, efficiency, and versatility currently rival leading source-available models.
Qwen2 released: Pretrained and instruction-tuned models of 5 sizes, trained on data in 27 additional languages (besides English and Chinese), state-of-the-art performance in a large number of benchmark evaluations, significantly improved performance in coding and mathematics, extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.

AI is Cracking a Hard Problem – Giving Computers a Sense of Smell - Researchers at the University of Michigan are using machine learning to create digital olfaction, enabling computers to classify smells by mapping molecular structures to odor descriptions, opening new possibilities in fields like perfumery, chemical sensing, and disease diagnostics.
Introducing Aurora: The First Large-Scale Foundation Model of the Atmosphere - Microsoft Research introduces Aurora, an advanced AI model for atmospheric forecasting that significantly enhances the accuracy of extreme weather predictions using a 3D Swin Transformer architecture and extensive training on diverse atmospheric data.
Using AI to Decode Dog Vocalizations - Researchers at the University of Michigan have developed an AI tool to classify dog barks using human speech processing models, which may enhance our understanding of animal communication and improve dog welfare.
Scaling Neural Machine Translation to 200 Languages - The NLLB Team introduces a neural machine translation model supporting 200 languages using the Sparsely Gated Mixture of Experts architecture, achieving a 44% improvement in translation quality over previous models and promoting language equity.
BrightEdge Releases Post Google I/O Data on The Impact of AI Overviews - BrightEdge reveals that AI Overviews in Google search are now less visible, most often triggered by Featured Snippets and questions, with their presence varying significantly by industry, while AI response precision has increased.
Code Transformation: Experimental Code Editing Capabilities - Code Transformation is Google's experimental model that allows editing existing Python code based on code context and natural language instructions, generating code diffs for tasks like adding docstrings, reducing nesting, cleaning up code, and fixing errors.
Introducing Stable Audio Open: An Open Source Model for Audio - Stable Audio Open is an open-source text-to-audio model that generates up to 47 seconds of high-quality audio samples from text prompts, aimed at creating sound effects, foley recordings, and production elements.
AI in Software Engineering at Google: Progress and the Path Ahead - Google discusses the latest AI advancements in software development tools, addressing challenges, methodologies, and future directions aimed at enhancing developer productivity and satisfaction.

HumanEvalPack: Extended Dataset for Code Evaluation - HumanEvalPack is an extension of OpenAI's HumanEval dataset, covering 6 programming languages and 3 tasks, used to evaluate code model performance across various scenarios.
FineWeb-Edu: The Finest Collection of Educational Web Content - FineWeb-Edu is a dataset comprising 1.3 trillion tokens of high-quality educational web pages, filtered from the larger FineWeb dataset using an educational quality classifier trained on annotations generated by the Llama3-70B-Instruct model.
ImageInWords: Unlocking Hyper-Detailed Image Descriptions - ImageInWords is a carefully designed human-in-the-loop annotation framework for creating hyper-detailed image descriptions, featuring both human and machine-generated annotations to promote the development of automatic metrics for evaluating image descriptions.

What We’ve Learned From a Year of Building with LLMs: A Practical Guide -
A practical guide to building successful products with language models (LLMs), covering tactical, operational, and strategic approaches based on the authors' experiences over the past year.