Papers description:

They propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. They’re seeing a fundamental shift: from specialized vertical models to unified foundation models for commerce, as models grow from 1.8M to 83M parameters, they are seeing consistent improvements.. Shopify’s results with HSTU architecture for shop recommendations: • 200-360% lift over baseline for merchant recommendations • 82% improvement in product recommendations • Scales with compute, similar to language models

The paper introduces a benchmark to evaluate the temporal reasoning abilities of large language models (LLMs). Temporal reasoning involves understanding and processing sequences of events over time, which remains a challenge for LLMs. This benchmark provides structured tasks to measure progress and highlight limitations in reasoning capabilities.

The paper introduces SWT-Bench, a benchmark designed to evaluate the ability of LLMs-based code agents to generate test cases from user-reported issues. The authors compiled real-world issues, corresponding bug fixes, and reference tests by analyzing popular GitHub repositories. Their findings indicate that LLMs, particularly those tailored for code repair, excel in producing relevant test cases, often surpassing specialized test generation systems. Additionally, the study demonstrates that these generated tests can effectively filter proposed code fixes, enhancing the precision of tools like SWE-Agent.

The paper introduces JKOnet*, a model that efficiently learns diffusion processes by recovering potential, interaction, and internal energy components from observational data. By minimizing a quadratic loss, JKOnet* outperforms existing methods in terms of sample efficiency, computational complexity, and accuracy. It provides a closed-form optimal solution for linearly parametrized functionals and achieves state-of-the-art accuracy in predicting cellular processes at a fraction of the computational cost of current methods.

The paper introduces the Asynchronous Perception Machine (APM), a novel architecture for test-time training (TTT). APM processes image patches independently and asynchronously, enabling semantic clustering and out-of-distribution image recognition. It scales efficiently to large datasets and empirically supports Geoffrey Hinton's GLOM concept, offering a new way to represent hierarchical input perception.

The paper analyzes the impact of future reward information on reinforcement learning (RL) agents. By measuring the ratio between standard RL agents and those with partial future-reward lookahead, the study quantifies the value of such information. The authors characterize worst-case reward distributions and derive exact ratios for worst-case expectations, revealing connections to offline RL and reward-free exploration. Their findings cover scenarios ranging from immediate reward observation to full future reward visibility.