Data Mining is nontrivial extraction of implicit, previously unknown and potential useful information from the databases. For a database with number of records and for a set of classes such that each record belongs to one of the given classes, the problem of classification is to decide the class to which the given record belongs. The classification problem is also to generate a model for each class from given data set. We are going to make use of supervised classification in which we have training dataset of record, and for each record the class to which it belongs is known. There are many approaches to supervised classification. Decision tree is attractive in data mining environment as they represent rules. Rules can readily expressed in natural languages and they can be even mapped through database access languages.
Now a day’s classification based on decision trees is one of the important problems in data mining which has applications in many areas. Database system has become highly distributed, and we are using many paradigms. We consider the problem of inducing decision tree in a large distributed network of highly distributed databases. The classification based on decision tree can be done on the existence of distributed databases in healthcare and in bioinformatics, human computer interaction and by the view that these databases are soon to contain large amounts of data, characterized by its high dimensionality. Current decision tree algorithms would require high communication bandwidth, memory, and they are less efficient and scalability reduces when executed on such large volume of data. So there are some approaches being developed to improve the scalability and even approaches to analyse the data distributed over a network.
A key challenge for data mining is tackling the problem of mining richly structured datasets, that is distributed and links between objects in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records. Data on the web is huge and distributed across various sites. The traditional approach is to integrate all data into one site and perform required analysis. The problem with this is its time consuming and not scalable, so we need to find more efficient algorithms to mine data that is distributed over the network.
|