Time series data is prevalent in important applications such as robotics, finance, healthcare, and cloud monitoring. In these applications, we typically encounter time series with very high dimensions where the ML task is to perform classification for signal characterization, regression for prediction, or function approximation within a reinforcement learning agent.

Despite the great number of important applications with time series data, the literature of ML is not as plentiful for this input type. Additionally, most ML algorithms have the underlying assumption that the samples are independent and identically distributed (iid) — which is not the case for most time series. To make it more frustrating, it is even scarcer to find research or examples for *high-dimensional* time series using deep learning. Thus, **this post discusses techniques that can be applied to perform common unsupervised and supervised learning tasks when the data is a high-dimensional time series.** It assumes that you have experience with deep learning and are now looking for solutions to problems involving time series input.

## An example in a Jupyter notebook with some of the most important points discussed is included at the end of the post.

Let’s review some fundamentals. Time series modeling has been a subject of study in different research communities — statistics, signal processing, economics, finance, physics, robotics, online machine learning. Fundamentally, most of the work follows a probabilistic treatment of the problem. In addition, the applied communities have their own techniques that might carry over or not to using deep learning for *your *problem. *The important aspect is that you must know the probabilistic nature of the time series problem at hand before using deep learning.*

Formally, we define a time series as a series of observations in a time sequence as

Here *d* is the dimension (or the number of features) and *t* is the index for each time step.

A high-dimensional input is one where *d* is relatively large. Notice that the row space *t* is the number of samples in the time domain and does not indicate high dimension (this means you can have high-dimension for a small dataset). Such high dimension presents a problem for ML because of the curse of dimensionality and also because many of the feature vectors can turn out to be noise. Thus, if we don’t reduce the dimension of the original input, we are very likely to overfit.

In the classical paradigm, the time series is modeled as sequences of random variables {**X***t} (notice this is different from the raw observations { x*t}), with mean and covariance defined as

Models in this form are referred to as stochastic processes. Furthermore, the time series is separated into components: *trend, seasonality, cycle, and residual*. The statistical models of choice are variations of ARMA and ARIMA, for stationary processes and nonstationary processes, respectively.

## Now, an important definition.

Definition:A (weakly) stationary time series is one in which the data generating process does not change over time. It follows that a nonstationary process violates the definition of stationarity and, therefore, has a probability distribution that changes over time. ARMA and ARIMA models have been widely used for some decades already but they make strong assumptions about linearity, which might not yield acceptable predictions for complex processes. The important aspect of this classical statistical treatment is to recognize that we need different probabilistic models based on the data generating process —i.e., to recognize the level of stationary or nonstationary of the input.

**State-Space and Markov models** these models were originally developed for Control applications and are more powerful than the linear ARMA and ARIMA models. They include the famous Kalman filter and Hidden Markov Models (HMMs). HMMs and their variants are still a great alternative over deep neural networks for some applications. In order to make predictions **y**, we are after finding a conditional posterior probability at some horizon *h *away from *t:*

With most time-series, we need some temporal dependence in order to make predictions. In special cases, we can simplify the time series and assume that it meets the Markov property. This assumption implies that the state **X***t is complete — i.e., it encodes all the necessary information to predict the future X*{t+1} and past and future states are independent. We will be

**Nonstationarity in ML:** Additional difficulties arise in the presence of epistemic uncertainty or nonstationarity in the input. If that’s the case, there is additional signal processing to be done prior to training and/or the machine learning framework must change from batch to online. I will leave the signal processing part out of this post but highlight the assumptions in the ML batch framework. For batch learning, we assume that the distribution of the training set and the testing set belong to the same data generating process. Therefore, if you have a stationary process, say a cloud monitoring system, then there are PAC learning guarantees. On the other hand, if you have a nonstationary process, say a stock market, then there are no PAC learning guarantees. In the latter case, you cannot set out to make good out-of-sample predictions with a batch ML framework. Now, unbeknownst to many, batch learning is the default option for ML. On the other hand, online learning is apt for sequential learning because it does not make assumptions about the probability distribution of the input — in fact, the data generating process could even be adversarial. Thus, an online learning framework that makes no assumptions about the probability of the input is the best bet for a nonstationary process but even then a non-trivial task.(read: regret minimization).

**Sequential dependence in ML:** In feed-forward neural networks, each sample in the training set is considered iid. Therefore, if the input is a time series the temporal dependence of the data is broken. The solution to this problem is solved straightforwardly by using recurrent neural networks in their modern version of LSTM (or GRU) cells. An alternative, less used for time series, is CNNs since they have also been shown to be successful at making predictions on time series input — after all, images have spatial dependence.

**Transformations:** Now, we must make some transformations to our high-dimensional time-series problem before making predictions with LSTMs or CNNs. First, the size of the high-dimensional input must be reduced. Second, it must be embedded in some acceptable form for the predictor. For the first task, a common misconception is that since CNNs do unsupervised feature extraction for images we can claim the same for time series input. That is not the case if we’re dealing with possibly thousands of additional irrelevant and noisy features. Hence, whether using CNNs or not, it is better to perform dimensionality reduction before training your model. Thus I begin with dimensionality reduction techniques.

The task of dimensionality reduction seeks to project a high-dimensional space into a lower-dimensional subspace that can still explain the data. Geometrically, think of the original input ** X** ∈ ℝ^{n × d} as a hypercube of

As explained before, time series have the property of temporal dependence between samples. However, the temporal dependence usually does not matter for dimensionality reduction (there’s some work that attempts to use PCA for time series — see this paper). Hence, we can reduce the dimensionality using the raw 2D input. We have a few options for the dimensionality reduction of time series input. Some common options, sorted by level of difficulty, are:

**SVD, PCA, ICA (unsupervised)**

SVD is a classical matrix algebra approach that is the bread and butter of dimensionality reduction of any matrix. PCA is defined as the eigenvectors of the covariance matrix but can also be viewed as truncated SVD. In PCA, the principal components are a sequence of orthogonal projections of the data (i.e., mutually uncorrelated) and ordered with respect to their variance. In ICA, we find a reduced subspace where the components are statistically independent.

Moderate methods are for non-linear data for which the basic methods do not work well.

**KPCA (nonlinear data)**

Kernalized version of PCA.

**Isomaps**

**Autoencoders**

With autoencoders we learn a lower-dimensional feature space by using a deep neural network with an encoder and decoder. Here, we compress the data by forcing a neural network to reproduce the same input after gradually decreasing and increasing the size of the layers.

**Variational Autoencoders (VAEs)**

Variational autoencoders are the generative version of autoencoders. An example in Keras for autoencoders and VAEs is here.

I recommend that you employ at least two methods and compare out-of-sample accuracy with the same learner. In the example provided, the dimension is reduced from 2000 to 100.

Feature selection seeks to find a subset of important features from the original space. This process has two implications that are different from dimensionality reduction: (1) the original feature vectors are ranked by some importance measure and the selected subset remains in the same form as the original data; (2) generally involves a supervised procedure.

**Shrinkage Methods (supervised)**

Shrinkage methods reduce the number of features by imposing weight penalties on the less important features. The most common are Lasso and Ridge regression. These methods are implicitly included via regularization of a neural network if they include L1 or L2 regularization in the loss objective.

**Tree Methods (supervised)**

The tree methods, such as Random Forests, use entropy measures in order to make the splits for classification or regression. These splits can, in turn, give information on how to rank the features for a selected subset.

In this section, we discuss how to embed or encode *each* sequence *into an appropriate representation for the predictor*. **To visualize this, think of your multi-variate time series sequence Q as an image that needs to be classified.** Now, if the time series input doesn’t consist of real-valued vectors at each time step (e.g. text), each time step must first be embedded into some real-valued representation (for example using Word2Vec). So in this section, we’re referring to the embedding of the *time series sequence*. Then, for what follows we assume that each time step is already in a real-valued representation.

LSTMs work better if we are in a regression setting, where we are trying to predict future time steps, or as a function approximator for an RL agent. If we’re using an LSTM as the predictor, we don’t need to *embed* the input per se but we do need to reshape it into a 3D tensor. This is simple.

**Input shape and sequence length:** In order to consider the sequential nature of the input matrix **X**, it needs to be reshaped into a 3D tensor as

where *n* is the number of sequences, *t* is the number of time steps per sequence, and *d* is the dimension of the sequence.

Thus, we need to determine a priori the length of the sequence (number of time steps) that are important for a prediction either through domain expertise, trial-and-error, or signal processing. The length of the sequences can be fixed or variable.

This embedding typically either: (a) transforms the time series to the frequency domain and then embeds it into an image or (b) uses recurrence plots. The resulting image can later be classified with a CNN-based architecture. If done correctly, I find (b) to be a better technique for classification (or characterization) of a signal with DL.

Most of the literature in this subject embeds a *univariate* time-series into an image (good paper for embedding univariate time-series into an image is here). The trouble is transforming *multivariate* time series into an equivalent image. A few ideas come to mind. Again, we can take a frequency domain approach and convert the multivariate time-domain input into a spectrogram and then embed it into an image. However, we lose information when converting to the frequency domain. Therefore, a preferred method is to convert the multivariate input into a 2D Recurrence plot by taking the entry-wise product of each univariate Recurrence Plot. This is done by taking each* *sequence of the reshaped 3D tensor and converting each 2D sequence into a recurrence plot or spectrogram.

In some applications, such as botnet detection, the input consists of a graph that changes with respect to time. In general, when you have relationships in your time-series input that can be encoded into a graph, you can do a dynamic graph embedding. However, the dynamic graph embedding has to be compatible with a neural network as we cannot in a simple way use a graph as input for a neural network. Thankfully, there are a few recent frameworks that allow for this, one of the best ones is Structure2Vec, which allows for end-to-end training. For an information-theoretic approach see Directed Information Graphs.

The first step is to decide whether you need DL at all. If your time series problem is relatively simple (i.e., not high-dimensional, stationary, linear) then using ARMA or ARIMA will probably yield good predictions (see Facebook’s prophet). If it’s not, you’re lucky that we now have function approximators (DL!) that can be used for time-series input. Keep in mind that, as discussed, time-series input is dependant on several previous time steps. Again, the first task is to reduce the dimensionality of the input, then we perform an embedding. After that, we’re ready to use a DL model to make predictions.

LSTM models are the most popular to model time series input. This is because LSTMs are designed to *remember* long-term dependencies and have shown great success in multiple applications, most importantly in NLP. They are the preferred model for regression and also work well for classification. By this point, you know that for a high-dimensional time series, we first reduce its dimensionality and then for LSTMs we reshape the input into a 3D tensor.

CNN models are the go-to machine for image classification. They have shown to be good predictors for time series input when the sequences are embedded as images. If we want to classify each sequence instead of making predictions into the future, then CNNs are a great option. After reducing the dimensionality of the raw input, we embed it into an image or a compatible graph. The thing you want to avoid is taking raw time series input and then perform 2D convolutions.

## Word of caution: applying 2D convolutions to

rawtime series input does not make sense.

We saw that high-dimensional time series are unlike other input types commonly found in the literature of Deep Learning. In order to use deep learning models for regression or classification, the first step is to use an appropriate technique to reduce the high dimension of the input. The second step is to use an appropriate embedding of the sequences so that the LSTM or CNN can handle the time dependencies. Finally, we must know the nature of the data generating process. If the process is nonstationary, using batch ML will not generalize well and the online ML approach is a more feasible option. Using the online ML paradigm for deep neural networks is the subject of another post.