RBFMINE: Distributed Data Mining with Radial Basis

Functions on the Computational Grid

** Investigators: George Roussos(SCSIS, BBK), Brad Baxter (Economics, BBK)**

Sponsor: The Nuffield Foundation

**Classification: **distributed computing, distributed data mining, radial basis functions.

Recently there has been considerable interest in employing radial basis functions (RBFs) for data mining. The main reason for this development is the fact that radial basis functions do not suffer from the "dimensionality curse", that is the number of sample points required to achieve sufficiently accurate approximation of a target function increases only linearly with the number of dimensions (or independent variables). This property represents a significant advantage over more traditional regression methods.

On the other hand, until recently the Global RBF method has been associated with very high computational cost both for the computation of the RBF coefficient and for the estimation of its value at a particular point. However, during the past five years significant advances have been made towards rapid computation and evaluation schemes. Thus, today the computational complexity of RBF approximation has been reduced to linear dependence on the number of samples as well as on the number of evaluation points (compare this against the cubic order of computation and the square order of evaluation of the naïve approach).

Currently, whenever data mining software employs RBFs for their superior dimensionality properties it does so either at the cost of high computational cost (which effectively restricts the applicability of the method to relatively small data sets) or by using Locally Supported RBFs which arguably offer significantly lower accuracy. Indeed, compared against their Global counterparts, Locally Supported RBFs compute approximations taking into account sample points only in the region of the locality of their centres and thus missing out on global features of the data set.

In this project we propose investigation of the following issues:

- A comparative study of rapid computation methods for Global RBFs. Although a plethora of alternative algorithms have been proposed in the bibliography there are no comparative studies showing the performance of the different methods under different circumstances (number of data points, distribution of data points, sensitivity to errors in sample data) especially with regard to approximations on higher dimensional manifolds which are of particular interest to data mining applications.
- A comparative study of rapid evaluation methods for Global RBFs. Indeed, the issues raised in the previous paragraph are also relevant to the various rapid evaluation methods.
- A study of Global RBFs for distributed data mining. A common characteristic of most of the rapid computation and evaluation algorithms is their hierarchical nature. This property provides the opportunity for building local models from the sample data independently --and indeed in parallel-- each sub-problem handled without using samples from any other sub-problem. The global features result from a combination of the local solutions at a higher level. This property is particularly important on the Computational Grid where model building is a collaborative process. In fact, this approach makes feasible building collaborative models without explicitly giving access to proprietary and/or confidential data but only to higher level, aggregated RBF based abstractions. Thus, effectively building a process based approach for the creation of a Virtual Organisation through collaborative knowledge extraction.
- The investigation proposed herein, is expected to produce a Grid enabled software toolkit for distributed data mining using Global RBFs. This toolkit will implement all competitive algorithmic alternatives and thus form the basis for the comparative study of different methods. The toolkit will be accompanied by a collection of data sets for benchmarking the different methods. The data set will offer a diversity of properties including the number of data points, the distribution of data points and last but not least the sensitivity to errors in sample data.