News
Entertainment
Science & Technology
Life
Culture & Art
Hobbies
News
Entertainment
Science & Technology
Culture & Art
Hobbies
Process-supervised reward models (PRMs) offer fine-grained, step-wise feedback on model responses, aiding in selecting effective reasoning paths for complex tasks. Unlike output reward models (ORMs), which evaluate responses based on final outputs, PRMs provide detailed assessments at each step, making them particularly valuable for reasoning-intensive applications. While PRMs have been extensively studied in language tasks, their application in multimodal settings remains largely unexplored. Most vision-language reward models still rely on the ORM approach, highlighting the need for further research into how PRMs can enhance multimodal learning and reasoning. Existing reward benchmarks primarily focus on text-based models, with some specifically designed
The advancement of large language models (LLMs) has significantly influenced interactive technologies, presenting both benefits and challenges. One prominent issue arising from these models is their potential to generate harmful content. Traditional moderation systems, typically employing binary classifications (safe vs. unsafe), lack the necessary granularity to distinguish varying levels of harmfulness effectively. This limitation can lead to either excessively restrictive moderation, diminishing user interaction, or inadequate filtering, which could expose users to harmful content. Salesforce AI introduces BingoGuard, an LLM-based moderation system designed to address the inadequacies of binary classification by predicting both binary safety labels and detailed severity levels.
LLMs have significantly advanced NLP, demonstrating strong text generation, comprehension, and reasoning capabilities. These models have been successfully applied across various domains, including education, intelligent decision-making, and gaming. LLMs serve as interactive tutors in education, aiding personalized learning and improving students’ reading and writing skills. In decision-making, they analyze large datasets to generate insights for complex problems. LLMs enhance player experiences by generating dynamic content and facilitating strategy development within gaming. However, despite these successes, their application to intricate tasks such as strategic gameplay in Gomoku remains challenging. Gomoku, a classic board game known for its simple rules yet deep
The rapid progress in artificial intelligence (AI) and machine learning (ML) research underscores the importance of accurately evaluating AI agents' capabilities in replicating complex, empirical research tasks traditionally performed by human researchers. Currently, systematic evaluation tools that precisely measure the ability of AI agents to autonomously reproduce ML research findings remain limited, posing challenges in fully understanding the potential and limitations of such systems. OpenAI has introduced PaperBench, a benchmark designed to evaluate the competence of AI agents in autonomously replicating state-of-the-art machine learning research. PaperBench specifically measures whether AI systems can accurately interpret research papers, independently develop the necessary
Hallucination remains a significant challenge in deploying Large Vision-Language Models (LVLMs), as these models often generate text misaligned with visual inputs. Unlike hallucination in LLMs, which arises from linguistic inconsistencies, LVLMs struggle with cross-modal discrepancies, leading to inaccurate image descriptions or incorrect spatial relationships. These models leverage vision encoders, such as CLIP, alongside pretrained text decoders to map visual information into language. Despite their strong performance in tasks like image captioning, visual question answering, and medical treatment planning, LVLMs remain prone to hallucination, which limits their real-world applicability. The issue stems from various factors, including statistical biases in pretraining, an
Nomic has announced the release of 'Nomic Embed Multimodal,' a groundbreaking embedding model that achieves state-of-the-art performance on visual document retrieval tasks. The new model seamlessly processes interleaved text, images, and screenshots, establishing a new high score on the Vidore-v2 benchmark for visual document retrieval. This advancement is particularly significant for retrieval augmented generation (RAG) applications working with PDF documents, where capturing both visual and textual context is crucial. Breaking New Ground in Visual Document Retrieval The Nomic Embed Multimodal 7B model has achieved an impressive 62.7 NDCG@5 score on the Vidore-v2 benchmark, representing a 2.8-point improvement over previous best-performing
Text-to-SQL translation, the task of transforming natural language queries into structured SQL statements, is essential for facilitating user-friendly database interactions. However, the task involves significant complexities, notably schema linking, handling compositional SQL syntax, and resolving ambiguities in user queries. While Large Language Models (LLMs) have shown robust capabilities across various domains, the efficacy of structured reasoning techniques such as Chain-of-Thought (CoT) within text-to-SQL contexts remains limited. Prior attempts employing zero-shot CoT or Direct Preference Optimization (DPO) without structured reasoning yielded marginal improvements, indicating the necessity for more rigorous methodologies. Snowflake introduces ExCoT, a structured framework designed to optimize open-source LLMs
Large Language Models (LLMs) significantly benefit from attention mechanisms, enabling the effective retrieval of contextual information. Nevertheless, traditional attention methods primarily depend on single token attention, where each attention weight is computed from a single pair of query and key vectors. This design inherently constrains the model's ability to discern contexts requiring the integration of multiple token signals, thereby limiting its effectiveness on complex linguistic dependencies. For example, identifying sentences simultaneously containing both 'Alice' and 'rabbit' is challenging because conventional attention mechanisms struggle to integrate multiple separate attention signals efficiently without substantially increasing model complexity. Meta AI addresses this limitation
Deploying LLMs presents challenges, particularly in optimizing efficiency, managing computational costs, and ensuring high-quality performance. LLM routing has emerged as a strategic solution to these challenges, enabling intelligent task allocation to the most suitable models or tools. Let’s delve into the intricacies of LLM routing, explore various tools and frameworks designed for its implementation, and examine academic perspectives on the subject. Understanding LLM Routing LLM routing is a process of examining incoming queries or tasks and directing them to the best-suited language model or collection of models in a system. This guarantees that every task is treated by the optimal
The terminal (on Mac/Linux) or command prompt (on Windows) is a powerful tool that allows you to interact with your computer using text commands instead of clicking through a graphical interface. While it might seem intimidating at first, mastering basic terminal commands can help you: Navigate through files and folders more efficiently Perform tasks that aren't possible through the regular interface Automate repetitive tasks Gain a deeper understanding of how your computer works This guide will introduce you to the essential commands and concepts to get you started, regardless of which operating system you use. Getting Started Opening the Terminal
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy optimization. Recent research focuses on algorithmic improvements, including eliminating critic models to reduce computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a few studies focus on RLHF data construction (i.e., training prompts) and its performance scaling based on these training prompts. The success of RLHF heavily depends on reward
The Transformer architecture revolutionised natural language processing with its self-attention mechanism, enabling parallel computation and effective context retrieval. However, Transformers face significant limitations when processing longer sequences due to their quadratic computational complexity. Linear Recurrent Neural Networks (RNNs) have emerged as a promising alternative, offering parallel training capabilities while maintaining linear inference-time complexity. The expressivity of these models depends fundamentally on their state-transition matrices. The evolution of linear RNNs has progressed from early models with token-independent state-transition matrices to more powerful token-dependent designs. The field has further advanced with non-diagonal structures that allow simultaneous mixing of information across both tokens
Amazon has revealed a new artificial intelligence (AI) model called Amazon Nova Act. This AI agent is designed to operate and take actions within a web browser, automating tasks like filling out forms, navigating interfaces, and handling popups. Think of it as an assistant working directly on websites. Amazon has also released Nova Act SDK, which lets developers experiment with the technology. Developers can create agents to handle simple online tasks. Current Status of AI Agents AI agents mostly talk or find information, responding in natural language or searching knowledge bases. According to Amazon, they envision AI agents being able
Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate reasoning chains and multiple retrieval steps. Current methods primarily depend on manually designed prompts or heuristics, posing limitations in scalability and flexibility. Additionally, generating supervised data for multi-step reasoning scenarios is often prohibitively expensive and practically infeasible. Researchers from Baichuan Inc., Tongji University, The University of Edinburgh, and Zhejiang University introduce ReSearch, a novel AI framework designed to train LLMs to integrate reasoning with search via reinforcement learning,
In this tutorial, we demonstrate how to build a prototype X-ray judgment tool using open-source libraries in Google Colab. By leveraging the power of TorchXRayVision for loading pre-trained DenseNet models and Gradio for creating an interactive user interface, we show how to process and classify chest X-ray images with minimal setup. This notebook guides you through image preprocessing, model inference, and result interpretation, all designed to run seamlessly on Colab without requiring external API keys or logins. Please note that this demo is intended for educational purposes only and should not be used as a substitute for professional clinical diagnosis.
Creative writing is a domain that thrives on diversity and imagination. Unlike fact-based or task-specific writing, where a single correct output may exist, creative writing involves numerous valid responses to a prompt. Stories, poems, and narratives can branch in countless directions, each with stylistic flavor and meaning. This inherent open-mindedness makes creative writing a prime challenge for AI systems, which need to maintain narrative coherence while producing novel and distinct outputs. The core issue lies in how large language models are refined after their initial training. Post-training methods often emphasize quality improvements by aligning responses with user preferences or maximizing
Table of contentsIntroductionInstallationWindowsmacOSLinuxVerifying InstallationGit Bash BasicsNavigation CommandsFile OperationsKeyboard ShortcutsGit ConfigurationAdditional ConfigurationsBasic Git WorkflowInitializing a RepositoryChecking StatusStaging FilesCommitting ChangesBranching and MergingWorking with BranchesMerging BranchesHandling Merge ConflictsDeleting BranchesRemote RepositoriesAdding a Remote RepositoryAdvanced Git CommandsStashing ChangesReverting ChangesInteractive RebaseTroubleshootingCommon Issues and SolutionsGit Best Practices.gitignore ExampleConclusion Introduction Git is a distributed version control system that helps you track changes in your code, collaborate with others, and maintain a history of your project. Git Bash is a terminal application for Windows that provides a Unix-like command-line experience for using Git. This guide will walk you through setting up Git, using Git Bash, and mastering essential Git
LLMs have shown impressive capabilities in reasoning tasks like Chain-of-Thought (CoT), enhancing accuracy and interpretability in complex problem-solving. While researchers are extending these capabilities to multi-modal domains, videos present unique challenges due to their temporal dimension. Unlike static images, videos require understanding dynamic interactions over time. Current visual CoT methods excel with static inputs but struggle with video content because they cannot explicitly localize or revisit specific moments in sequences. Humans overcome these challenges by breaking down complex problems, identifying and revisiting key moments, and synthesizing observations into coherent answers. This approach highlights the need for AI systems to manage
Approximate Nearest Neighbor Search (ANNS) is a fundamental vector search technique that efficiently identifies similar items in high-dimensional vector spaces. Traditionally, ANNS has served as the backbone for retrieval engines and recommendation systems, however, it struggles to keep pace with modern Transformer architectures that employ higher-dimensional embeddings and larger datasets. Unlike deep learning systems that can be horizontally scaled due to their stateless nature, ANNS remains centralized, creating a severe single-machine throughput bottleneck. Empirical testing with 100-million scale datasets reveals that even state-of-the-art CPU implementations of the Hierarchical Navigable Small World (HNSW) algorithm can’t maintain adequate performance as vector dimensions
In this tutorial, we demonstrate how to evaluate the quality of LLM-generated responses using Atla's Python SDK, a powerful tool for automating evaluation workflows with natural language criteria. Powered by Selene, Atla's state-of-the-art evaluator model, we analyze whether legal responses align with the principles of the GDPR (General Data Protection Regulation). Atla's platform enables programmatic assessments using custom or predefined criteria with synchronous and asynchronous support via the official Atla SDK. In this implementation, we did the following: Used custom GDPR evaluation logic Queried Selene to return binary scores (0 or 1) and human-readable critiques Processed the evaluation in batch
In the evolving landscape of web development, the emergence of no-code platforms has significantly broadened access to application creation. Among these, Hostinger Horizons stands out as an AI-powered tool designed to facilitate the building, editing, and publishing of custom web applications without necessitating any coding expertise. By integrating essential services such as hosting, domain registration, and email functionalities, Hostinger Horizons offers a comprehensive solution for individuals and businesses seeking to establish a digital presence. Technical Overview Hostinger Horizons utilizes advanced artificial intelligence and natural language processing to interpret user inputs and generate functional web applications. The platform features a user-friendly
AI agent memory comprises multiple layers, each serving a distinct role in shaping the agent’s behavior and decision-making. By dividing memory into different types, it is better to understand and design AI systems that are both contextually aware and responsive. Let’s explore the four key types of memory commonly used in AI agents: Episodic, Semantic, Procedural, and Short-Term (or Working) Memory, along with the interplay between long-term and short-term storage. Image Source 1. Episodic Memory: Recalling Past Interactions Episodic memory in AI refers to the storage of past interactions and the specific actions taken by the agent. Like human memory,
Recent advancements in AI scaling laws have shifted from merely increasing model size and training data to optimizing inference-time computation. This approach, exemplified by models like OpenAI o1 and DeepSeek R1, enhances model performance by leveraging additional computational resources during inference. Test-time budget forcing has emerged as an efficient technique in LLMs, enabling improved performance with minimal token sampling. Similarly, inference-time scaling has gained traction in diffusion models, particularly in reward-based sampling, where iterative refinement helps generate outputs that better align with user preferences. This method is crucial for text-to-image generation, where naïve sampling often fails to fully capture intricate
Reinforcement Learning from Verifiable Rewards (RLVR) has recently emerged as a promising method for enhancing reasoning abilities in language models without direct supervision. This approach has shown notable success in mathematics and coding, where reasoning naturally aligns with structured problem-solving. While studies have demonstrated that RLVR alone can lead to self-evolved reasoning, research has largely been limited to these technical fields. Efforts to extend RLVR have explored synthetic datasets, such as those involving sequential tasks and object counting, indicating potential but also highlighting the challenges of adapting this method to different domains. Expanding RLVR to broader areas remains an open
Large language models (LLMs) have become vital across domains, enabling high-performance applications such as natural language generation, scientific research, and conversational agents. Underneath these advancements lies the transformer architecture, where alternating layers of attention mechanisms and feed-forward networks (FFNs) sequentially process tokenized input. However, with an increase in size and complexity, the computational burden required for inference grows substantially, creating an efficiency bottleneck. Efficient inference is now a critical concern, with many research groups focusing on strategies that can reduce latency, increase throughput, and cut computational costs while maintaining or improving model performance. At the center of this efficiency problem
Visual Studio Code (VSCode) is a powerful, free source-code editor that makes it easy to write and run Python code. This guide will walk you through setting up VSCode for Python development, step by step. Prerequisites Before we begin, make sure you have: Python installed on your computer An internet connection Basic familiarity with your computer's operating system Step 1: Download and Install Visual Studio Code Windows, macOS, and Linux Go to the official VSCode website: https://code.visualstudio.com/ Click the 'Download' button for your operating system Run the installer and follow the installation instructions Launch VSCode after installation Step 2: Install
Large language models struggle to process and reason over lengthy, complex texts without losing essential context. Traditional models often suffer from context loss, inefficient handling of long-range dependencies, and difficulties aligning with human preferences, affecting the accuracy and efficiency of their responses. Tencent’s Hunyuan-T1 directly tackles these challenges by integrating a novel Mamba-powered architecture with advanced reinforcement learning and curriculum strategies, ensuring robust context capture and enhanced reasoning capabilities. Hunyuan-T1 is the first model powered by the innovative Mamba architecture, a design that fuses Hybrid Transformer and Mixture-of-Experts (MoE) technologies. Built on the TurboS fast-thinking base, Hunyuan-T1 is specifically engineered
Supervised fine-tuning (SFT) is the standard training paradigm for large language models (LLMs) and graphic user interface (GUI) agents. However, SFT demands high-quality labeled datasets, resulting in extended training periods and high computational expenses. This dependence on extensive data creates bottlenecks in AI development workflows. Moreover, existing VLM-based GUI agents trained through SFT show performance deficiencies when confronted with out-of-domain scenarios, severely limiting their practical utility in diverse real-world applications. Rule-based reinforcement learning (RL) or reinforcement fine-tuning (RFT) is a promising alternative, requiring only dozens to thousands of samples instead of massive datasets. Various approaches have been developed to advance
In this tutorial, we explore an innovative approach that blends deep learning with physical laws by leveraging Physics-Informed Neural Networks (PINNs) to solve the one-dimensional Burgers’ equation. Using PyTorch on Google Colab, we demonstrate how to encode the governing differential equation directly into the neural network’s loss function, allowing the model to learn the solution 𝑢(𝑥,𝑡) that inherently respects the underlying physics. This technique reduces the reliance on large labeled datasets and offers a fresh perspective on solving complex, non-linear partial differential equations using modern computational tools. Copy CodeCopiedUse a different Browser!pip install torch matplotlib First, we install the PyTorch
3D self-supervised learning (SSL) has faced persistent challenges in developing semantically meaningful point representations suitable for diverse applications with minimal supervision. Despite substantial progress in image-based SSL, existing point cloud SSL methods have largely been limited due to the issue known as the 'geometric shortcut,' where models excessively rely on low-level geometric features like surface normals or point heights. This reliance compromises the generalizability and semantic depth of the representations, hindering their practical deployment. Researchers from the University of Hong Kong and Meta Reality Labs Research introduce Sonata, an advanced approach designed to address these fundamental challenges. Sonata employs a
Large vision-language models (LVLMs) integrate large language models with image processing capabilities, enabling them to interpret images and generate coherent textual responses. While they excel at recognizing visual objects and responding to prompts, they often falter when presented with problems requiring multi-step reasoning. Vision-language tasks like understanding charts, solving visual math questions, or interpreting diagrams demand more than recognition; they need the ability to follow logical steps based on visual cues. Despite advancements in model architecture, current systems consistently struggle to produce accurate and interpretable answers in such complex scenarios. A major limitation in current vision-language models is their inability
In this tutorial, we demonstrate the integration of Python’s robust data manipulation library Pandas with Google Cloud’s advanced generative capabilities through the google.generativeai package and the Gemini Pro model. By setting up the environment with the necessary libraries, configuring the Google Cloud API key, and leveraging the IPython display functionalities, the code provides a step-by-step approach to building a data science agent analyzing a sample sales dataset. The example shows how to convert a DataFrame into markdown format and then use natural language queries to generate insights about the data, highlighting the potential of combining traditional data analysis tools with
Time series analysis faces significant hurdles in data availability, quality, and diversity, critical factors in developing effective foundation models. Real-world datasets often fall short due to regulatory limitations, inherent biases, poor quality, and limited paired textual annotations, making it difficult to create robust, generalizable Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs). This scarcity impacts tasks such as forecasting, classification, anomaly detection, reasoning, and captioning, limiting the full potential of current advancements in artificial intelligence. Salesforce AI Research has addressed these challenges by proposing a comprehensive approach to leveraging synthetic data for enhancing TSFMs and
Developing therapeutics continues to be an inherently costly and challenging endeavor, characterized by high failure rates and prolonged development timelines. The traditional drug discovery process necessitates extensive experimental validations from initial target identification to late-stage clinical trials, consuming substantial resources and time. Computational methodologies, particularly machine learning and predictive modeling, have emerged as pivotal tools to streamline this process. However, existing computational models are typically highly specialized, limiting their effectiveness in addressing diverse therapeutic tasks and offering limited interactive reasoning capabilities required for scientific inquiry and analysis. To address these limitations, Google AI has introduced TxGemma, a collection of generalist
Monocular depth estimation involves predicting scene depth from a single RGB image—a fundamental task in computer vision with wide-ranging applications, including augmented reality, robotics, and 3D scene understanding. In this tutorial, we implement Intel’s MiDaS (Monocular Depth Estimation via a Multi-Scale Vision Transformer), a state-of-the-art model designed for high-quality depth prediction from a single image. Leveraging Google Colab as the compute platform, along with PyTorch, OpenCV, and Matplotlib, this tutorial enables you to upload your image and visualize the corresponding depth maps easily. Copy CodeCopiedUse a different Browser!pip install -q timm opencv-python matplotlib First, we install the necessary Python libraries—timm
Autoregressive visual generation models have emerged as a groundbreaking approach to image synthesis, drawing inspiration from language model token prediction mechanisms. These innovative models utilize image tokenizers to transform visual content into discrete or continuous tokens. The approach facilitates flexible multimodal integrations and allows adaptation of architectural innovations from LLM research. However, the field has a critical challenge of determining the optimal token representation strategy. The choice between discrete and continuous token representations remains a fundamental dilemma, significantly impacting model complexity and generation quality. Existing methods include visual tokenization that explores two primary approaches: continuous and discrete token representations. Variational
The rapid advancements in search engine technologies integrated with large language models (LLMs) have predominantly favored proprietary solutions such as Google's GPT-4o Search Preview and Perplexity's Sonar Reasoning Pro. While these proprietary systems offer strong performance, their closed-source nature poses significant challenges, particularly concerning transparency, innovation, and community collaboration. This exclusivity limits customization and hampers broader academic and entrepreneurial engagement with search-enhanced AI. In response to these limitations, researchers from the University of Washington, Princeton University, and UC Berkeley have introduced Open Deep Search (ODS)—an open-source search AI framework designed for seamless integration with any user-selected LLM in a modular
Compression is a cornerstone of computational intelligence, deeply rooted in the theory of Kolmogorov complexity, which defines the minimal program needed to reproduce a given sequence. Unlike traditional compression methods that look for repetition and redundancy, Kolmogorov’s framework interprets compression as a problem of discovering structured patterns through programmatic representation. While the theory promises optimal compression, its uncomputability poses a significant hurdle. Nevertheless, the emergence of large language models capable of code generation opens an intriguing opportunity to test how closely modern systems can approximate this theoretical ideal by reasoning through code rather than pattern matching. A core issue arises
In this guide, you will learn how to deploy a machine learning model as an API using FastAPI. We will create an API that predicts the species of a penguin based on its bill length and flipper length. Prerequisites Basic knowledge of Python Python installed on your system (preferably version 3.7 or higher) Familiarity with machine learning concepts (optional) Step 1: Set Up Your Environment Create a Project DirectoryOpen your terminal and create a new directory for your project: Set Up a Virtual EnvironmentCreate and activate a virtual environment: On windows use: venvScriptsactivate Install Required PackagesInstall FastAPI, Uvicorn (for serving
Large Language Models (LLMs) are becoming integral to modern technology, driving agentic systems that interact dynamically with external environments. Despite their impressive capabilities, LLMs are highly vulnerable to prompt injection attacks. These attacks occur when adversaries inject malicious instructions through untrusted data sources, aiming to compromise the system by extracting sensitive data or executing harmful operations. Traditional security methods, such as model training and prompt engineering, have shown limited effectiveness, underscoring the urgent need for robust defenses. Google DeepMind Researchers propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models