Computer ScienceGeneral

Data Mining


Database applications in Scientific, Business and other areas are on the increase with the advances in Information Technology. The largest business database in the world is with Walmart which can handle over 20 million transactions each day. Another such huge database in oil exploration is employed by Mobil Oil Corporation to store more than 100 terabytes of data. Further, NASA earth observing system of orbiting satellites and other spaceborne instruments are generating 50 gigabytes of remotely sensed image data per hour. It is observed that the amount of information in the world doubles every 20 months and hence the size and number of databases are increasing faster. Data Base Management System (DBMS) provides access to the stored data but this is a small part of what could be achieved from the data. The On-Line Transaction Process (OLTP) is good for placing the data into databases quickly, but not good in providing. meaningful analysis. The knowledge provided by data analysis is a step ahead of information of data base. Thus it is imperative to know how to extract knowledge from stored data thus providing a basis for data mining or knowledge discovery in the data base. Knowledge discovery in databases (KDD) or data mining can be defined as the non-trivial extraction of implicit, previously unknown and potentially useful information from the data. The steps in KDD process are: data selection, preprocessing, transformation, data mining, and interpretation/evaluation of the results as shown in Figure 1.1.

Steps of KDD Process

Data mining is a high-level technique used to present and analyze the data for decision-makers. Enormous wealth of information is embedded in huge databases belonging to enterprises thus providing interest in areas of knowledge discovery and data mining. KDD refers to the overall process of discovering useful knowledge from data while data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process [30]. As data mining is the central part of the KDD process, the term data mining and the term knowledge discovery in databases have been used interchangeably by many researchers.

1.1 Foundations of data mining:

Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business because it is supported by three technologies that are now sufficiently mature: They are massive data collection, powerful multiprocessor computers, and data mining algorithms.

Commercial databases are growing at unprecedented rates. A META Group survey of data warehouse projects found that 19% of respondents are beyond the 50-gigabyte level, while 59% expect to be there by the second quarter of 1996. In some industries, such as retail, these numbers can be much larger. In the evolution from business data to business information, a new step has built upon the previous one. From the user’s point of view, the four steps listed in Table-1.1 are revolutionary because they allow new business questions to be answered accurately and quickly.

Steps of evaluation of data mining

1.2 Data mining tasks:

The most common types of data mining tasks based on the kind of knowledge are listed below.

Characterization is the summarization or abstraction of a set of task-relevant data into a relation, called generalized relation, which can then be used for extraction of characteristic rules. The characteristic rules present the characteristics of the data set, called the target class, and can be at multiple conceptual levels and viewed from different angles.

Discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. The target and contrasting classes can be specified by the user and the corresponding data objects retrieved through database queries.

Association rule mining is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. An association rule is in the form of “A1 ^ A2 ……… Ai -> B1 ^ B2 ^ …….. Bj“, where Ai and Bi are attribute-value pairs. The association rule X -> Y is interpreted as database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y.

Classification is the process of finding a set of models that describe and distinguish data classes or concepts for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data.

Prediction is the estimation or forecast of the possible values of some missing data or the value distribution of certain attribute in a set of objects. This involves finding the set of attributes relevant to the attribute of interest and predicting the value distribution based on a set of data similar to the selected object.

Clustering is the identification of classes for a set of unclassified objects based on their attributes. The objects are so clustered that the intraclass similarities are maximized and the interclass similarities are minimized based on some criteria. Once the clusters are decided, the objects in a cluster are summarized to form the class description.

Evolution analysis describes and models regularities or trends for objects whose behavior change over time. Although this may include characterization, discrimination, association, classification or clustering of time-related data, distinct features of such an analysis include time-series data analysis, sequence or periodically pattern matching and similarity-based data analysis.

Outlier mining is the discovery and evaluation of the deviation patterns of objects in the target data in a time-related database. Most data mining methods discard outliers as noise or exceptions. In some applications, such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. Outliers may be detected using statistical tests that assume a distribution or probability model for the data or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.

1.3 Architecture for Data Mining:

For efficient utilization, data mining techniques must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside the warehouse, requiring extra steps for extracting, importing and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining.

Integrated Data Mining Architecture

Figure 1.2 illustrates architecture for advanced analysis in a large data warehouse. The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems like Sybase, Oracle, Redbrick, etc., and should be optimized for flexible and fast data access.

1.4 Data Mining Applications:

A wide range of companies has deployed successful applications of data mining. Early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing. The technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (customer prospecting, retention, campaign management).

Some successful application areas:

  • A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the coming few months.
  • A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product.
  • A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services.
  • A large consumer packaged goods company can apply data mining to improve its sales process to retailers.

Original Research Article:

Comment here