Vision Transformers for Time Series Forecasting - Lessons Learned

Arindom Banerjee
Aug 14, 2024
10 min read

Abstract:

Vision Transformers (ViT) for time series forecasting, presents an approach that bridges computer vision and temporal data analysis. Traditional time series models often struggle with capturing complex, long-range dependencies in high-dimensional, multivariate data. Image-based time series forecasting addresses this limitation by leveraging the power of computer vision techniques to extract rich spatial-temporal features. Vision Transformers, originally designed for image recognition, have emerged as a promising solution due to their ability to model global dependencies and hierarchical representations.

Our research implements a ViT model for stock price forecasting, achieving an R² score of 0.9354 and a Mean Absolute Percentage Error (MAPE) of 5.0010% on Apple Inc. (AAPL) stock data. We also examine key innovations, including time series to image-like conversions, window-based attention mechanisms, and hierarchical representations. The paper addresses challenges such as preserving temporal order in the inherently permutation-invariant self-attention mechanism.

Through comprehensive performance analysis and comparison with state-of-the-art models like Swin4TS, we provide insights into ViT's strengths and limitations in time series forecasting. This work contributes to the growing intersection of computer vision and time series analysis, offering a thorough understanding of ViT's potential in this domain and paving the way for future innovations in financial forecasting and beyond.

Section: Introduction

Recent advances in deep learning have led to the adaptation of Vision Transformers (ViT) for time series forecasting, marking a significant shift in approach to this critical task. Originally designed for image recognition, Vision Transformers have demonstrated remarkable potential in capturing complex temporal patterns and long-range dependencies in time series data, particularly in financial markets and other domains with high-dimensional, multivariate data.

The application of Vision Transformers to time series forecasting stems from the insight that temporal data can be effectively represented as image-like structures. This novel approach allows leveraging the powerful attention mechanisms and hierarchical representations inherent to transformer architectures, which have revolutionized natural language processing and computer vision tasks.

Key innovations in this field include:

1. Data Representation: Converting time series into image-like formats, enabling the application of vision-based deep learning techniques [1].

2. Architectural Adaptations: Modifications to the original ViT architecture to better suit temporal data, such as the Swin4TS model's window-based attention for capturing multi-scale temporal information [2].

3. Efficient Processing: Ability to handle longer input sequences compared to traditional recurrent neural networks, addressing a critical limitation in time series analysis [3].

Our research implements a Vision Transformer model for stock price forecasting, demonstrating its efficacy on real-world financial data. Using Apple Inc. (AAPL) stock as a case study, our model achieved impressive results:

· R² score of 0.9354, explaining 93.54% of stock price variance

· Mean Absolute Percentage Error (MAPE) of 5.0010%

· Root Mean Square Error (RMSE) of $8.2755

These metrics underscore the model's strong predictive capabilities, aligning with and sometimes surpassing the performance reported in recent literature on ViT adaptations for time series [1,2].

However, challenges remain. The permutation-invariant nature of self-attention mechanisms may limit the capture of strict temporal dependencies [3]. Additionally, the complexity of these models raises questions about their efficiency compared to simpler approaches for certain datasets.

This report delves into the current state of Vision Transformers in time series forecasting, presenting:

· A comprehensive summary of published work in this rapidly evolving field (Section 2)

· A detailed description of our implementation, including architecture and training methodology (Section 3)

· A performance analysis comparing our results with other state-of-the-art approaches

By exploring these aspects, we aim to provide a thorough understanding of the potential and limitations of Vision Transformers in time series forecasting, paving the way for future innovations in this exciting intersection of computer vision and time series analysis.

[For more information on the foundational Vision Transformer architecture, refer to the original paper: https://arxiv.org/abs/2010.11929] [6]

Section: State of the Art on Vision Transformers for Time Series Forecasting

Some initial context, based on the original Vision Transformer paper (Dosovitskiy et al., 2021),

1. Original ViT Architecture:

• Splits an image into fixed-size patches, linearly embeds each patch, adds position embeddings, and feeds the resulting sequence of vectors to a standard Transformer encoder.

• Uses a [class] token to perform classification, similar to BERT's [CLS] token.

• Achieves state-of-the-art performance on image recognition benchmarks when pre-trained on large datasets.

2. Scalability and Transfer Learning:

• ViT shows excellent scalability, outperforming CNNs when trained on sufficient data.

• Pre-training on large datasets (like JFT-300M) is crucial for ViT's performance, especially for smaller datasets.

3. Efficiency:

• While ViT can be more computationally efficient than CNNs for pretraining, it may be less efficient for inference on smaller hardware due to the dense attention operations.

Adaptation of Vision Transformers for Time Series

• Concept:

o Convert time series data into image-like representations.

o Apply Vision Transformer architectures to these representations.

• Key Models:

- ViTST (Li et al., 2023) [1]:

o Converts irregularly sampled multivariate time series into line graph images.

o Processes these images with a Vision Transformer for classification.

- Swin4TS [2]:

o Adapts Swin Transformer architecture for time series.

o Treats time series as 1D sequences analogous to image patches

• Motivations:

1. Leverage pre-trained vision models for time series analysis.

2. Take advantage of hierarchical architectures and efficient attention mechanisms

3. Enable processing of longer input sequences.

4. Capture both local and global temporal dependencies.

2. Key Adaptations and Innovations

• Swin4TS Model [2]:

o Window-based attention for linear computational complexity

o Hierarchical representation to capture multi-scale temporal information.

o Strategies for both channel-dependent and channel-independent multivariate series

• ViTST Model [1]:

o Time series to image conversion techniques

o Integration of temporal information into image representations

3. Experimental Results

• Swin4TS Performance [2]:

o Outperforms existing baselines on 8 benchmark datasets.

o Improvements over state-of-the-art:

o Traffic dataset: 10.3% improvement in MSE (0.397 → 0.356)

o ILI dataset: 15.8% improvement in MSE (1.967 → 1.657)

o Strong performance on both short-term (24 steps) and long-term (720 steps) forecasting

• ViTST Performance [1]:

o Outperforms existing methods on time series classification tasks.

o On PAM dataset: 7.3% improvement in accuracy, 6.7% in F1 score

• Ablation Studies:

o Swin4TS [2]: Window-based attention and hierarchical design contribute 3-4% improvement each.

o ViTST [1]: Positional encoding and multi-head attention crucial for performance

Section: Performance Analysis:

Analysis of AAPL stock – predicted vs actual with 5 different approaches: approaches used were quite simple in this round – extensive architectural variations will follow:

· Simple LSTM (base estimator): A basic LSTM model applied directly to time series data, showing decent fitting but struggling near some troughs and lacking smoothness in predictions. [7]

· CNN-GADF: Utilizes Convolutional Neural Networks on Grampian Angular Difference Field (GADF) images generated from time series data, but performs poorly due to inability to capture sequential properties effectively. [7]

· LSTM-Image: Applies LSTM to sequences of GADF images, showing improved performance over simple LSTM by better capturing temporal patterns in the image representations of time series data. [7]

· ResNet-LSTM: A two-stage model using ResNet for feature extraction from GADF images, followed by LSTM for forecasting, demonstrating the best performance by effectively combining spatial pattern recognition with temporal dependency modeling. [7]

· Vision Transformer: Adapts the Vision Transformer architecture for time series forecasting, achieving the highest R² score (0.9354) among all models, indicating strong explanatory power for stock price variance, despite slightly higher error metrics compared to ResNet-LSTM.

a) Best performing models:

· ResNet-LSTM: Lowest MSE, RMSE, MAE, and MAPE.

· Vision Transformer: Highest R2 score, indicating the best explanation of variance.

· Simple LSTM: Second-best performance across most metrics.

b) Poorly performing models:

· CNN-GADF: High error metrics and negative R2 score.

· LSTM Image: Highest error metrics and lowest R2 score.

c) Vision Transformer performance:

· The vision transformer model shows strong performance, with the highest R2 score (0.9354) among all models.

· It has slightly higher error metrics compared to ResNet-LSTM and Simple LSTM, but still performs well overall.

· The MAPE of 5.0010% indicates that, on average, its predictions are off by 5% of the actual stock price.

d) Visual inspection:

· The stock price predictions plot (Image 2) confirms that CNN-GADF performs poorly, with its predictions deviating significantly from the actual prices.

· Simple LSTM and ResNet-LSTM appear to follow the actual prices most closely.

· The vision transformer plot (Image 3) shows that its predictions closely follow the actual prices, with some minor deviations.

Performance Effectiveness of Vision Transformers for TS:

The vision transformer approach proves to be highly effective for time series forecasting in this case:

· High explanatory power: With an R2 score of 0.9354, it explains 93.54% of the variance in stock prices, outperforming all other models in this aspect.

· Competitive error metrics: While not the lowest, its error metrics are competitive with the best-performing models (ResNet-LSTM and Simple LSTM).

· Good visual fit: The plot shows that the vision transformer's predictions closely follow the actual stock prices, capturing both trends and some volatility.

· Balanced performance: It offers a good balance between explanatory power (R2) and prediction accuracy (error metrics), making it a robust choice for this task.

· Improvement over CNN-based approach: The vision transformer significantly outperforms the CNN-GADF model, suggesting that the transformer architecture is more suitable for capturing temporal dependencies in stock price data.

Performance Summary: While the ResNet-LSTM model shows the lowest error metrics, the vision transformer approach demonstrates excellent overall performance, particularly in explaining stock price variance. Its effectiveness suggests that it's a valuable tool for time series forecasting in financial services, offering a good balance between accuracy and explanatory power.

o ResNet-LSTM has the best (lowest) MSE, RMSE, MAE, and MAPE.

o Vision Transformer has the best (highest) R2 score.

o CNN-GADF and LSTM Image perform poorly compared to the other models.

Section: Code Implementation

Python implementation of a Vision Transformer (ViT) model adapted for time series forecasting, specifically for predicting stock prices.

Instructions:

The git repository above is at https://github.com/ArindamBanerji/Time-Series/tree/master/ViT-TS-Forecast - open the file Simple_Colab_scaffoding_ViT_TS.ipynb from the directory in google colab and it will do the rest, including doing the git pulls, automatically.

Basic Structure

Data Preparation:

· Stock data is fetched using the yfinance library.

· Time series data is converted into image-like representations, similar to the approach in ViTST [1].

· A sequence length of 16 is used, reshaped into a 4x4 grid for each feature.

· RobustScaler is applied for feature scaling.

Model Architecture:

· The core of the model is a Vision Transformer (ViT) from the vit_pytorch library.

· Key parameters:

· Image size: 4x4 (derived from sequence length)

· Patch size: 1.

· Number of classes: 128

· Embedding dimension: 256

· Depth: 6 layers

· Number of heads: 8

· MLP dimension: 512

· A regression head is added on top of the ViT output for price prediction.

Training Process:

· Implements k-fold cross-validation (5 folds) for robust evaluation.

· Uses AdamW optimizer with weight decay for regularization.

· Employs learning rate scheduling (ReduceLROnPlateau) and early stopping.

· Incorporates data augmentation by adding Gaussian noise to training samples.

Key Implementation Features:

· Adaptation of ViT for time series, aligning with recent research trends [1,2].

· Use of cross-validation and data augmentation to enhance generalization.

· Implementation of regularization techniques (L2, dropout) to prevent overfitting.

Code Results and Performance:

The model demonstrates strong performance:

· High R2 score (0.9354) indicates the model explains 93.54% of the variance in stock price.

· MAPE of 5.0010% suggests predictions are, on average, within 5% of actual values.

· RMSE of 8.2755 provides the average prediction error in dollars.

Comparison to Literature:

· The model's performance aligns with the promising results reported for ViT adaptations in time series forecasting [1,2].

· The high R2 score and low MAPE suggest the model is capturing both long-term trends and short-term fluctuations effectively.

· The use of a Vision Transformer architecture allows for processing longer input sequences efficiently, addressing a key advantage noted in the literature [2,3].

Code Roadmap:

· The current implementation doesn't explicitly address the permutation-invariance issue of self-attention, which could be a limitation for capturing temporal dependencies [3].

· Future work could explore hybrid architectures combining CNNs or RNNs with the Transformer components [1,2].

· Investigating pre-training strategies specific to financial time series data could potentially improve performance [1,2].

· Extending the model to multi-step forecasting and testing on a wider range of stocks and market conditions would provide a more comprehensive evaluation.

Section - Challenges and Limitations to ViT in TS Forecasting

• Preserving Temporal Order:

o Self-attention is inherently permutation-invariant

o Experiments show some Transformer models are insensitive to input order shuffling [3]

o On Exchange Rate dataset, FEDformer and Autoformer show <1% performance drop with shuffled inputs

• Input Sequence Length:

o Some Transformer models struggle to leverage longer input sequences [3]

o Linear models often show consistent improvement with longer inputs, while Transformers plateau or degrade.

o On Traffic dataset (720-step forecast), Linear model MSE improves from 0.531 to 0.426 as input length increases from 96 to 720.

o FEDformer's MSE remains around 0.62-0.63 across the same input length range.

• Dataset Size and Complexity:

o Time series datasets often smaller than typical computer vision or NLP datasets

o Experiments show reducing training data sometimes improves performance [3]

o On Traffic dataset, using 1 year of data instead of full dataset improved FED former’s MSE by 2-3%

Section - Future Directions

o Develop Transformer variants that more explicitly model temporal ordering [2,3]

o Explore hybrid architectures combining CNNs, RNNs, and Transformer components [1,2]

o Investigate pre-training strategies specific to time series data [1,2]

o Apply Vision Transformer concepts to other time series tasks (anomaly detection, imputation) [1]

o Create more challenging benchmarks that better differentiate model capabilities [3]

Section - Conclusions and Hypotheses

o Vision Transformer adaptations show promise for time series forecasting, particularly for long-term predictions and multivariate series [1,2]

o The success of these models may be more due to their ability to process long sequences efficiently rather than capturing semantic relationships as in vision tasks [3]

o Simple models (e.g., linear) can sometimes outperform complex Transformer architectures, suggesting current benchmarks may not fully challenge model capabilities [3]

o The permutation-invariant nature of self-attention may be fundamentally limiting for capturing temporal dependencies [3]

o The implemented Vision Transformer model for stock price forecasting shows promising results, aligning with recent research in applying vision transformers to time series data. Its strong performance on AAPL stock data suggests potential for broader application in financial forecasting, while also leaving room for further improvements and extensions based on the latest findings in the field.

Section - Bibliography:

[1] Li, Z., Li, S., & Yan, X. (2023). Time Series as Images: Vision Transformer for Irregularly Sampled Time Series. arXiv preprint arXiv:2303.12799.

[2] [Unnamed authors] (2023). Long-term Time Series Forecasting with Vision Transformer. Paper under review for ICLR 2024.

[3] Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2023). Are Transformers Effective for Time Series Forecasting? In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 9, pp. 11121-11128).

[4] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 12, pp. 11106-11115).

[5] Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34, 22419-22430.

[6] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.11929

[7]. Approaches to Image Data based TS Forecasting: https://github.com/ShubhamG2311/Financial-Time-Series-Forecasting

[8] For Code-bases and data-sets, also see https://github.com/Leezekun/ViTST

Arindam Banerji (banerji.arindam@gmail.com)

Vision Transformers for Time Series Forecasting - Lessons Learned

Recent Posts

Comments

Stay Connected with us