How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose you are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, etc. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, etc.) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, you want to learn which books are likely to be purchased by a customer who you know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be "familiar" with the results of these types of analyses, if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "check-out", based on some rules such as "customers who buy book title A are also likely to purchase book title B," etc. Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple "bookstore-example" discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships ("cross-tabulation tables") that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero's), but also determine the factorial degree of the tables that contain the important association rules. To summarize, Association Rules will allow you to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.
Click here for more information on Association Rules. Go to the Electronic Statistics Homepage for the complete textbook. |
| April 2003 |
| April 2, 2003 | Introduction to Visual Basic and STATISTICA Visual Basic | Tulsa, OK |
| April 3, 2003 | Visual Basic Applications in STATISTICA | Tulsa, OK |
| April 21-22, 2003 | Introduction | Tulsa, OK |
| April 23, 2003 | DOE | Tulsa, OK |
| April 24, 2003 | SPC | Tulsa, OK |
| April 25, 2003 | Graphical Data Analysis | Tulsa, OK |
| May 2003 |
| May 6-7, 2003 | Introduction | Ft. Lauderdale, FL |
| May 8, 2003 | ANOVA/Regression | Ft. Lauderdale, FL |
| May 9, 2003 | Introduction to Visual Basic and STATISTICA Visual Basic | Ft. Lauderdale, FL |
| May 19-20, 2003 | Introduction | Tulsa, OK |
| May 21, 2003 | ANOVA/Regression | Tulsa, OK |
| May 22, 2003 | Multivariate Analysis | Tulsa, OK |
| May 23, 2003 | Introduction to Visual Basic and STATISTICA Visual Basic | Tulsa, OK |
| June 2003 |
| June 3-4, 2003 | Introduction | Philadelphia, PA |
| June 5, 2003 | DOE | Philadelphia, PA |
| June 6, 2003 | SPC | Philadelphia, PA |
| June 16-17, 2003 | Introduction | Tulsa, OK |
| June 18, 2003 | Introduction to Visual Basic and STATISTICA Visual Basic | Tulsa, OK |
| June 19, 2003 | ANOVA/Regression | Tulsa, OK |
| June 20, 2003 | Multivariate Analysis | Tulsa, OK |
| June 23-24, 2003 | Introduction | Dallas, TX |
| June 25, 2003 | SPC | Dallas, TX |
| June 26, 2003 | DOE | Dallas, TX |
| Back to Top |
| Request Quote |
| StatSoft Home Page |
Pacific
Suite 1, 46-48 Howard Street
North Melbourne VIC 3051
Australia
Phone: +61 3 9348 9422
Fax: +61 3 9348 9420
e-mail: info@statsoft.com.au
©Copyright StatSoft, Inc., 1984-2006.
StatSoft, StatSoft logo, STATISTICA, Enterprise/QC, Enterprise, Data Miner, SEPATH and GTrees are trademarks of StatSoft, Inc.