papers
Papers in reversed chronological order. Authors’ names are often in alphabetical order in theoretical works.
2025
- To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space ModelsEran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, and Etai LittwinarXiv preprint arXiv:2510.14826, 2025
State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any "truly long-form" generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
@article{malach2025infinity, title = {To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models}, author = {Malach, Eran and Saremi, Omid and Williamson, Sinead and Bradley, Arwen and Lotfi, Aryo and Abbe, Emmanuel and Susskind, Josh and Littwin, Etai}, journal = {arXiv preprint arXiv:2510.14826}, year = {2025}, } - EPFL - ThesisLearning to Reason with Neural Networks: Principles and MethodsAryo LotfiEPFL, 2025
This thesis focuses on understanding and improving the reasoning capabilities of neural networks. It develops theoretical results and empirical analyses to uncover reasoning potential and limitations, leveraging these insights to guide the design of improved architectures and training methods. A key challenge in reasoning tasks is achieving out-of-distribution (OOD) and length generalization, where models must go beyond simple memorization or pattern matching. OOD generalization is especially critical in reasoning, as the combinatorial nature of input spaces makes representative data sampling infeasible. To study this, we introduce the “generalization on the unseen (GOTU)” setting: a regime in which models are fully trained on one portion of the domain and tested on a completely disjoint subset. This reframes the problem of OOD generalization as one of understanding a model’s implicit bias. Focusing on Boolean function learning, we show that models such as Transformers, random feature models, and linear networks trained by (S)GD tend to learn “minimum-degree interpolators”, solutions with minimal Fourier mass on high-degree components. This provides a theoretical explanation for failures in length generalization, especially in parity tasks. Building on this insight, we propose a curriculum learning method that prioritizes Boolean inputs (samples with low Hamming weight) early in training. Empirically, this strategy reduces both sample and time complexity for learning parity functions. We further prove a separation result: in a setting where data is drawn from a mixture of sparse and dense inputs, a two-layer ReLU network trained with curriculum-based SGD learns parities of sufficiently high degree using significantly fewer optimization steps than models trained without curriculum on the same data distribution. Next, we study the reasoning capacity of Transformers and introduce the notion of “globality degree” of a target distribution, which measures the minimum number of tokens required in addition to the tokens histogram to correlate non-trivially with the target. We show that high-globality tasks, such as multi-step syllogistic reasoning, are not efficiently learnable by regular Transformers. However, introducing intermediate reasoning steps, via scratchpads or chain-of-thought prompting, can break the globality and make learning efficient. To improve generalization further, we propose “inductive scratchpads”, which impose a Markovian structure over reasoning steps. These not only break the globality barrier but also enable up to 6x length generalization in arithmetic tasks. Finally, we extend our study of globality and inductive scratchpads to the visual domain. We introduce tasks involving graphs, strings, mazes, and image grids that require global visual reasoning. We find that large Vision Transformer (ViT) models struggle on these tasks due to their inherent globality. To mitigate this, we propose “chain-of-sketch (CoS)” techniques, which, analogous to chain-of-thought and scratchpad methods in text, rely on intermediate visual subtasks to facilitate learning the primary task. We further introduce an inductive variant of CoS, which provides improved out-of-distribution generalization and enables learning with smaller model sizes compared to non-inductive CoS in most tasks.
@phdthesis{lotfi2025thesis, address = {Lausanne}, title = {Learning to Reason with Neural Networks: Principles and Methods}, url = {https://infoscience.epfl.ch/handle/20.500.14299/252918}, doi = {10.5075/epfl-thesis-11426}, school = {EPFL}, author = {Lotfi, Aryo}, year = {2025}, keywords = {neural networks | reasoning | out-of-distribution generalization | Transformers | implicit bias | scratchpads | chain-of-thought | curriculum learning}, language = {en}, } - RL for Reasoning by Adaptively Revealing RationalesMohammad Hossein Amani, Aryo Lotfi, Nicolas Mario Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, and Robert WestarXiv preprint arXiv:2506.18110, 2025
We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model’s past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
@article{amani2025rl, title = {RL for Reasoning by Adaptively Revealing Rationales}, author = {Amani, Mohammad Hossein and Lotfi, Aryo and Baldwin, Nicolas Mario and Bengio, Samy and Farajtabar, Mehrdad and Abbe, Emmanuel and West, Robert}, journal = {arXiv preprint arXiv:2506.18110}, year = {2025}, }
2024
- Chain-of-Sketch: Enabling Global Visual ReasoningAryo Lotfi*, Enrico Fini*, Samy Bengio, Moin Nabi, and Emmanuel AbbearXiv preprint arXiv:2410.08165, 2024
Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in tackling tasks requiring more global reasoning, where local features do not provide significant information. Minsky and Papert put forward such tasks in 1969 with their connectivity study, exposing the limitations of the perceptron model. In this paper, we introduce an expanded set of global visual datasets involving graphs, strings, mazes, and image grids. We show that large vision models still struggle to learn these tasks efficiently. Similarly, state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain this learning inefficiency by means of the ’globality degree’ measure. To mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the chain-of-thought and scratchpad techniques used in language models, CoS breaks the original task into intermediate visual steps to help learn a complex task. In addition, we show that not all CoS strategies perform equally well. Our key insight is to impose a Markovian structure on the CoS frames. This leads to the introduction of ’inductive CoS’ which achieves better out-of-distribution generalization and performs well even with smaller models compared to non-inductive variants.
@article{lotfi2024chain, title = {Chain-of-Sketch: Enabling Global Visual Reasoning}, author = {Lotfi, Aryo and Fini, Enrico and Bengio, Samy and Nabi, Moin and Abbe, Emmanuel}, journal = {arXiv preprint arXiv:2410.08165}, year = {2024}, } - NeurIPSHow Far Can Transformers Reason? The Globality Barrier and Inductive ScratchpadEmmanuel Abbeαβ, Samy Bengio, Aryo Lotfi, Colin Sandon, and Omid SaremiIn Advances in Neural Information Processing Systems, 2024
Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of ’globality degree’ of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by TC0/TC1 classes (further studied here), since the globality relates to correlations with the more limited NC0 class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of ’inductive scratchpad’, that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to 6× for some arithmetic tasks depending on the input formatting.
@inproceedings{abbe2024far, author = {Abbeαβ, Emmanuel and Bengio, Samy and Lotfi, Aryo and Sandon, Colin and Saremi, Omid}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Globerson, A. and Mackey, L. and Belgrave, D. and Fan, A. and Paquet, U. and Tomczak, J. and Zhang, C.}, pages = {27850--27895}, publisher = {Curran Associates, Inc.}, title = {How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad}, url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/3107e4bdb658c79053d7ef59cbc804dd-Paper-Conference.pdf}, volume = {37}, year = {2024}, } - Generalization on the Unseen, Logic Reasoning and Degree CurriculumEmmanuel Abbeαβ, Samy Bengio, Aryo Lotfi, and Kevin RizkJournal of Machine Learning Research, 2024
This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an ’extrapolating’ or ’reasoning’ learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.
@article{abbe2024GOTU-jmlr, author = {Abbeαβ, Emmanuel and Bengio, Samy and Lotfi, Aryo and Rizk, Kevin}, title = {Generalization on the Unseen, Logic Reasoning and Degree Curriculum}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {331}, pages = {1--58}, url = {http://jmlr.org/papers/v25/24-0220.html}, }
2023
- ICMLGeneralization on the Unseen, Logic Reasoning and Degree CurriculumEmmanuel Abbeαβ, Samy Bengio, Aryo Lotfi, and Kevin RizkIn Proceedings of the 40th International Conference on Machine Learning, 2023
This paper received the outstanding paper award at ICML 2023.
This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an ’extrapolating’ or ’reasoning’ learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky min-degree solutions. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.
@inproceedings{pmlr-v202-abbe23a, title = {Generalization on the Unseen, Logic Reasoning and Degree Curriculum}, author = {Abbeαβ, Emmanuel and Bengio, Samy and Lotfi, Aryo and Rizk, Kevin}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {31--60}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, url = {https://proceedings.mlr.press/v202/abbe23a.html}, } - NeurIPSProvable Advantage of Curriculum Learning on Parity Targets with Mixed InputsEmmanuel Abbeαβ, Elisabetta Cornacchia, and Aryo LotfiIn Advances in Neural Information Processing Systems, 2023
Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.
@inproceedings{neurips-parity-curriculum, author = {Abbeαβ, Emmanuel and Cornacchia, Elisabetta and Lotfi, Aryo}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Oh, A. and Neumann, T. and Globerson, A. and Saenko, K. and Hardt, M. and Levine, S.}, pages = {24291--24321}, publisher = {Curran Associates, Inc.}, title = {Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs}, volume = {36}, year = {2023}, }
2022
- NeurIPSLearning to Reason with Neural Networks: Generalization, Unseen Data and Boolean MeasuresEmmanuel Abbeαβ, Samy Bengio, Elisabetta Cornacchia, Jon Kleinberg, Aryo Lotfi, Maithra Raghu, and Chiyuan ZhangIn Advances in Neural Information Processing Systems, 2022
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a ’reasoning’ function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
@inproceedings{neurips-boolean-pvr, author = {Abbeαβ, Emmanuel and Bengio, Samy and Cornacchia, Elisabetta and Kleinberg, Jon and Lotfi, Aryo and Raghu, Maithra and Zhang, Chiyuan}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Koyejo, S. and Mohamed, S. and Agarwal, A. and Belgrave, D. and Cho, K. and Oh, A.}, pages = {2709--2722}, publisher = {Curran Associates, Inc.}, title = {Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures}, volume = {35}, year = {2022}, }
2021
- Semi-Supervised Disentanglement of Class-Related and Class-Independent Factors in VAESina Hajimiri, Aryo Lotfi, and Mahdieh Soleymani Baghshah2021
In recent years, extending variational autoencoder’s framework to learn disentangled representations has received much attention. We address this problem by proposing a framework capable of disentangling class-related and class-independent factors of variation in data. Our framework employs an attention mechanism in its latent space in order to improve the process of extracting class-related factors from data. We also deal with the multimodality of data distribution by utilizing mixture models as learnable prior distributions, as well as incorporating the Bhattacharyya coefficient in the objective function to prevent highly overlapping mixtures. Our model’s encoder is further trained in a semi-supervised manner, with a small fraction of labeled data, to improve representations’ interpretability. Experiments show that our framework disentangles class-related and class-independent factors of variation and learns interpretable features. Moreover, we demonstrate our model’s performance with quantitative and qualitative results on various datasets.
@misc{hajimiri2021semisupervised, title = {Semi-Supervised Disentanglement of Class-Related and Class-Independent Factors in VAE}, author = {Hajimiri, Sina and Lotfi, Aryo and Baghshah, Mahdieh Soleymani}, year = {2021}, eprint = {2102.00892}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }