Introduction into data mining processes
What is data mining? The term data mining (DM) is considered a class of applications processing a large amount of data looking for hidden patterns and regularities that can be used to predict future behaviour. Data mining term is relatively wide and covers a larger set of methods arising from mathematical statistical methods, but also other processes which people used without computers assistance.
Some typical examples are:
- Decision tree constructed from the history of the membership, with purpose to decide whether a potential member will get a credit card or loan or not;
- Finding regularities in the behaviour of tourists in order to provide them different models of discounts, and thereby attract new customers
- “Diapers and beer” – looking at transactions from the retail environment to conclude why consumers often buy diapers and beer. Why is it so and its cause? This is not the aim of sellers, but to find more familiar types of customers and offer them something “more”.
- In search of human genome DM methods helped in discovering the causes of many hereditary diseases (e.g. diabetes genes responsible for its formation)
Data mining has resulted in several scientific disciplines whose multidisciplinary synergy achieved combining the effects of which are important:
- Artificial intelligence – especially the so-called resort. “Machine learning”
- Research algorithms for clustering
- Visualization techniques
Statistics is at the heart of most data mining methods, and some believe other data mining methods are also part of standard statistical analysis. Area of machine – learning is used to enable software to learn some of the models themselves, especially in the case of neural network etc. Algorithms for clustering are described later in Chapter 6.5.3 DM clustering, and visualization techniques are important to prepare data, more or less easily come to a conclusion without too much help of mathematical apparatus. Fact is data sources in this case are almost always located in the database, and that they are part of the process preceding analysis, part of the preparation of data. It is necessary to mention here that this is less interesting but vital part of data mining.
Data mining process can be divided into several important steps:
|1. Data collection||Longest process in time perspective. If having in the event a production data source, its preparation may require a long period, even a few months. Extracting details in this process often requires a very good knowledge of ERP systems, and considering that they are object of work, bigger set of data has to be prepared in order to get more relevant results. Also it’s very usual it comes to transactions such as payments for POS terminals must be also prepared. These are all large quantities data that even only handling with them is relatively big problem.|
|2. Data cleansing||Cleaning of garbage data is also a long process and it has to be done on a set of rules with the attributes that are used for analysis. Typical examples of sex (‘M’, ‘F’), years (18 .. 100} where are evicted all rows that contain some obviously false attribute values appeared for any reason, mostly by mistake.|
|2.1. Creating a test set||In the domain of machine learning data set is divided into 2 groups – set for learning, and the second part group set to test the hypothesis. With the first group computer – DM algorithm (e.g., neural networks) learns. Later results are tested and compares with the second set of results that are known well. Goal is to judge how well DM algorithm learned and foresees results.|
|3. Pattern recognition||DM in the narrower sense – the execution of the algorithm|
|4. Evaluation and visualization of results||Not every discovered fact is true. Very often due to lack of a testing data set, the results of mining are not relevant. Are results relevant or not should be decided by expert for area of analysis.|
Table 1. Data mining most important functions
Data mining software can help companies of different industries in the prediction of behaviour of their customers. Take for example credit card house. DM software is often used for so-called „fraud detection“, i.e. to recognize the fraud on the cards (preferably before they occur). How does it work? Take into account the historical behaviour of members, which has its own habits in n-number visible from his past transactions.
With the example described below, can be seen that the buyer based on habit from past transactions to buy goods from 50 to 400 USD. Most of the goods were purchased in stores such as “retail chains”. So the customer buys food and similar goods. Rare are transactions with a low amount of purchases and other types of point of sale. Suddenly appears in the type of trade transaction is marked as a luxury goods with a very high amount of the transaction.
Can be concluded that the difference from the average amount is high, the distance from the position in graph square where most purchases are also very large. The transaction (although it may be entirely legal if the buyer intends to engage and buy engagement ring) can be considered (at least spoken) suspected. Is it really the result of theft and the thief attempts to quickly and easily buy goods payable in the form of gold / jewellery, etc?