data_mining

**Keywords :** clustering, kernel methods, linear algebra, parallelization

**Clustering** aims at partitioning a p-dimensional data set in K compact and homogeneous clusters, with respect to some “similarity measure” between the elements in the given data set. Clustering has many applications in a large variety of fields : Biology, Pattern Recognition, Image Segmentations, geometric modeling…

Among the recent partitioning algorithms, **kernel based methods**, such as spectral clustering, offers the opportunity to cluster the data with arbitrary shape for the resulting clusters.
The **Spectral Clustering** consists in creating, from the spectral elements of a Gaussian affinity matrix, a low-dimension space in which data are grouped into clusters.
This unsupervised method is mainly based on Gaussian affinity measure, its parameter and its spectral elements. However, questions about the separability of clusters in the projection space and the spectral parameter choices remained open. The rule of the parameter of Gaussian affinity was first investigated through quality measures and two heuristics for choosing this setting were defined.
Then, the method was studied by considering the spectral element of the Gaussian affinity matrix. By interpreting this matrix as the discretization of the heat kernel defined on the whole space and using finite elements, the eigenvectors of the affinity matrix are asymptotic representation of functions whose support is included in one connected component. These results helped to define the properties of clustering and conditions on the Gaussian parameter. As applications, Kinetic Spectral Clustering was applied for **segmenting dynamic PET images** in collaboration with INSERM, University of Tours, Universities Paris-7 and Paris-11 or in geometric modeling, kernel methods were used for **detecting similarities on parametric surfaces** in collaboration with IRIT-VORTEX team.

With emerging technologies in various applications, very large amount of input data needs to be incorporated and, to be efficient, clustering methods require **high performance computing** in general
(including fine-tuned kernels for linear algebra and numerical optimization).
So, parallelization strategies by **decomposition into sub-domains** were formulated and tested on geometrical and realistic examples.

Some clustering methods can be viewed as structural and **sparse linear algebra** is at the heart of innovative clustering methods to treat large scale simulation on high performance computers. **Block structure identification** in sparse matrices permits providing direct information on clusters in data.

data_mining.txt · Last modified: 2015/03/06 09:46 (external edit)