Nonlinear Dimensionality Reduction with UMAP

Links to the materials: report | GitHub repository.

This is my course project for MA4270 Data Modelling and Computation, where I get to explore manifold learning (or nonlinear dimensionality reduction) for the first time.

The method that I focused on in this project was UMAP and its parametric version. Except from reading these papers, I learned most details and intuitions about this approach by going through the official implementations and the official documents. My report has some add-ons to the explanation of the attractive and repulsive forces, e.g. how they behave as the probabilistic weight \(p_{ij}\) changes.

In addition, I also explored using probabilistic PCA on the nearest neighbors to approximate the tangent space (and hence estimate the dimension of the underlying manifold), it worked as expected (see the report if you are interested). It is based on the simple insight that the tangent space of an \(n\)-dimensional manifold is an \(n\)-dimensional vector space: Let \(\phi: U_\alpha \to \mathbb{R}^n\) be a chart of the manifold \(M\), then \(\phi\) induces an isomorphism \(\phi^\ast: \mathcal{E}_{\mathbb{R}^n}(V') \to \mathcal{E}_M(\phi^{-1}(V'))\) such that \((\phi^\ast f)(x)=f(\phi(x))\) for \(x\in\phi^{-1}(V')\) and \(f \in \mathcal{E}_{\mathbb{R}^n}(V')\), where \(V’\) is an open subset of \(\phi(U)\). Let \(p \in U_\alpha\) and take the direct limit, we have an isomorphism \(\phi_p^\ast: \mathcal{E}_{\mathbb{R}^n,\phi(p)}\to\mathcal{E}_{M, p}\) between germs. Put \(d\phi_p: T_p(M) \to T_{\phi(p)}(\mathbb{R}^n)\) such that \(d\phi_p(D) = D\circ \phi_p^\ast \in T_{\phi(p)}(\mathbb{R}^n)\) for all \(D \in T_p(M)\), i.e. the following diagram commutes:

Commutative Diagram

Consequently, the Jacobian \(d\phi_p: T_p(M) \to T_{\phi(p)}(\mathbb{R}^n)\) is a linear isomorphism between the tangent spaces. It is known that

$$T_{\phi(p)}(\mathbb{R}^n) = \text{span}\left(\left\{\frac{\partial}{\partial x_1}, \ldots, \frac{\partial}{\partial x_n}\right\}\right)$$

is an \(n\)-dimensional vector space, hence the tangent space \(T_p(M)\) of the \(n\)-dimensional manifold \(M\) at point \(p\) is also \(n\)-dimensional.

Note. I found this topic interesting since the geometry of non-Euclidean parameter spaces often arises in robotics (e.g. this paper). In addition, I was wondering if I could borrow the force-directed graph drawing paradigm to “stretch and contract” the input data, making it more suitable for Gaussian processes with stationary kernels, similar to a paper on input warping through Beta CDF. The Beta CDF approach does not seem enough promising to me at first glance, since it only “stretches and contracts” in an axis-aligned manner, so I was looking for alternative methods. However, it also seems a bit non-trivial to make UMAP work in this scenario, and maybe deep kernel learning should already be sufficient (perhaps more powerful as well?) for the same purpose.