Time Series

General Time-Series Models

In all these works, the code, model and weights are released:

MOMENT (Auton Lab, 2024)
Chronos (AWS, 2024),
MOIRAI-MOE (Silvio Savarese, Salesforce Research, 2024): Time-series foundation model, which uses mixture of expertes to select for different data frequencies. It is build upon MOIRAI (Salesforce Research, 2024).
Timer-XL (Tsinghua University, 2025) implemented other methods as well in here. The older work is Timer (2024) and the newer work Sundial (2025).
TOTO (Datadog, 2025) comes with code and weights
TOTEM (Georgia Gkioxari, Caltech, 2024)
TimesFM (Google Research, 2024)
Lag-Llama (ServiceNow, 2024)
TimeFound (2025)
TTMs (IBM, 2024)

Older works: PatchTST (2022), TimeGPT-1 (Nixtla, 2023), TIME-LLM, LLMTime, AutoTimes, GPT-4TS or FPT (2023).

Autoformer, Informer, Reformer for the long-term forecasting. Some of these methods are provided in HuggingFace Time Series Models. In Transformers Effective for Time Series Forecasting?, argues the transformers are not needed.

Time-Series Representation Learning

TF-C: Time-Frequency Consistency (TF-C) Model (Harvard MIMS Lab, NeurIPS 2022) - A cross-domain time-series representation model that leverages contrastive learning between time-domain and frequency-domain views of the same signal. By enforcing consistent embeddings in both domains, TF-C learns general features transferable to diverse sensors (EEG, accelerometer, etc.). Weights are available.
TS-TCC: Time-Series Representation Learning via Temporal and Contextual Contrasting: contrastive. augmentation: jitter, scale the amplitude and permutation.
TimesURL: Learning Universal Representations of Time Series via Contrastive Learning: contrastive and reconstruction based (MAE) based on TS2Ve model. augmentation: left-righ cropping with frequency swaping of the negative example. hard negatives are temporal and instance based. In temporal way, the augmentation mix in different time. In the instance-based augmentation, this mixing occurs in mixing the different instances.
TS2Vec: Towards universal representation of time series. Augmentation: croping
CPC: Representation Learning with Contrastive Predictive Coding: The contrastive learning works such as SimCLR, MoCo, SwAV, DINO are based on the contrastive loss introduced in this paper. The postive are the predictions.
TNC (Temporal Neighborhood Coding): Triplet loss (neighbor positive and further negative)

Generalized Category Discovery (GCD)

GCD (Generalized Category Discovery) uses supervised contrastive and self-supervised contrastive learning to obtain representations. These representations are then clustered with supervised K-means, enforcing that the labeled samples remain in their original cluster. A Hungarian matching assignment computes clustering accuracy on the labeled data, and this accuracy determines the number of clusters. SimGCD: is a parametric extension of GCD that employs both supervised and unsupervised contrastive losses for representation learning, alongside supervised and unsupervised clustering losses with mean-entropy maximization to shape the clusters. Further SPTNet Builds on SimGCD by introducing Spatial Prompt Tuning—additional prompt parameters learned on image patches—to further improve clustering performance. μGCD Similar to SimGCD, but maintains a teacher model whose weights are updated via exponential moving average (EMA). Note that in both SimGCD and SPTNet, the teacher network is never trained directly, it is simply a detached version of the student model.

GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery outperforms SimGCD in visual similar categories using CLIP text encoder.

SelEx is based on GCD, with improvements coming from the clustering component to ensure balanced clusters during training, as well as from the use of both supervised and unsupervised self-expertise. Self-Expertise assigns different weights to samples depending on whether they are positive or negative, and this weighting is influenced by their class hierarchies (coarse or fine-grained).

Some GCD or NCD (Novel Category Discovery) e.g. UNO methods are based on pioneering works such as SeLa (Asano): self-labeling via simultaneous clustering and representation learning, and SwAV.

Imbalanced Generalized Category Discovery (GCD)

SimGCD: Clustering is trained jointly with the representation learning network. This is not an imbalanced data setting. However, entropy is used as a regularization to prevent the model from over-predicting certain label classes. This issue is more about imbalanced prediction, which arises during joint learning of clustering and representation, but not in separate training as done in the original GCD method. MASA: Multi-Activity Sequence Alignment via Implicit Clustering is related in terms of parametric clustering, though it addresses a different task.
LegoGCD
AGCD
BaCon: Towards Distribution-Agnostic Generalized Category Discovery
Long-Tailed Learning for Generalized Category Discovery
Generalized Category Discovery under the Long-Tailed Distribution
DebiasGCD
Long-tailed GCD
ImbaGCD

Multimodal Time Series

Time-series → text (captioning). TSML introduces a multimodal encoder–decoder that merges a 1-D CNN–based time-series encoder with a positional text‐token stream, and learns this stack end-to-end on an in-context–generated, cross-modally denoised synthetic caption corpus—setting a new state-of-the-art in descriptive accuracy across multiple benchmarks. TADACap, in contrast, requires no gradient updates: it employs a novel diverse‐retrieval strategy to pull the most relevant series–caption pairs from a domain‐specific memory bank and reuses those captions directly—achieving comparable semantic quality with dramatically lower annotation effort and zero fine-tuning. Together, these approaches illustrate the full spectrum—from fully trained specialist decoders to pure retrieval–plus–reuse pipelines—for interpretable time-series narration.

Chat-style time-series assistants. ChatTS treats multivariate time series as a first-class modality by generating attribute-rich synthetic data (via an attribute-based time-series generator and the Time Series Evol-Instruct algorithm) and fine-tuning both a lightweight context-aware time-series encoder and a 14 B LLM on six alignment and four reasoning tasks—yielding a chat interface that can answer questions, detect anomalies, and explain forecasts directly from raw numbers. ChatTime, by contrast, reframes each normalized and discretized numeric value as a new “foreign-language” token, expands a 1 B-parameter LLM’s vocabulary accordingly, and then applies continuous pre-training plus instruction fine-tuning—updating only the added token embeddings and heads (≈ 350 M trainable parameters)—to deliver zero-shot forecasting and seamless bidirectional dialogue between time series and text without touching the core model weights. Together, they span the design space from full LLM fine-tuning for maximal conversational fidelity to parameter-efficient tuning for lightweight, on-device time-series assistants.

Forecasting & reasoning with LLMs in the loop. TimeXL integrates a prototype‐based multimodal encoder with a closed-loop trio of LLM stages—prediction, reflection, and refinement—to produce up to an 8.9 % AUC improvement alongside human-centric, case-based rationales, without requiring full LLM fine-tuning. CAPTime freezes both its pretrained time-series encoder and base LLM, then aligns temporal patterns with exogenous text via learnable interactions and a mixture-of-distribution experts, yielding calibrated, multimodal probabilistic forecasts. SMETimes systematically evaluates sub-3 B-parameter “Small Language Models” using statistical prompting, an adaptive fusion embedding architecture, and a dynamic mixture-of-experts framework to rival 7 B baselines—achieving 3.8× faster training, 5.2× lower memory, and state-of-the-art accuracy. TimeCMA employs dual-branch encoding—weak, disentangled time-series embeddings alongside robust LLM-derived prompt embeddings—and aligns them via cross-modality similarity, passing only the last token to downstream predictors to cut computation, outperforming prior methods on eight datasets.

Only ChatTS, ChatTime, SMETimes weights are released.

Discriminative Representation

The representation that can be used in GCN (Generalized Category Discovery) (GCN, SelEx).

Contrastive learning, Sparse autoencoder or older method such as DEC (Deep Embedded Clustering), SOM (Self Organizing Maps).

Characteristics of Time Series

Implicit Reasoning in Deep Time Series Forecasting: It is observed that certain linear, MLP-based, and patch-based Transformer models generalize effectively in carefully structured out-of-distribution scenarios, suggesting underexplored reasoning capabilities beyond simple pattern memorization.