Nov 20, 2009

Data Warehouse Design

Data Warehouse can be designed in two different ways based on Ralph Kimball and Bill Inmon's model.
In Inmon's architecture data from OLTP databse is stored in a warehouse before it is transferred to the data marts. In Inmon's view the data warehouse is a real database.

According to Kimball's architecture data is transferred directly from OLTP database to the data mart. In Kimball's model the data warehouse is actually a collection of integrated data marts.

Nov 12, 2009

SQL Server Replication

Replication is the process of creating and maintaining a separate copy of database in another server or the same server and keeping both synchronized. Replication is useful in a number of scenario:

1. To integrate multiple database to build a central warehouse: Businesses operating on a number of distant locations usually collect data in separate database. However the senior management needs to generate reports showing analysis of data collected from all these sites. Hence the need for central warehouse is increasing and replication proves to be a powerful way to create and maintain the warehouse. It can also be used to create and maintain a separate data mart for OLAP operations.

2. To ensure higher availability and improved performance of web applications: Web applications with the increasing number of users faces huge problem in avoiding the extra load on the database. Replication along with other techniques can offer the right solution by distributing users to a number of replicated database in different servers and thus improving the performance and ensure high availability.

Sep 28, 2009

OLAP Video Tutorial

This is an exciting video I came across where OLAP concepts are explained in a easy to understand way using cartoons.

Sep 26, 2009

Dimensional Modeling


Dimensional Modeling refers to the technique used to design logical models for database which can support OLAP and Data Warehousing operations. It differs from the relational modeling or ER modeling in the sense that it can be used for data warehouse as well as relational databases.

Dimensional modeling is based on two concepts called facts and dimensions. Facts are usually numeric values or something that can be measured or aggregated. For example: monthly sales amount. Dimensions are groups of hierarchies and attributes that are used to define facts. For example product and its color. In a dimensional model a group of facts forms a fact table while a group of dimensions form a dimension table.


Sep 20, 2009

Market Basket Analysis


Market Basket Analysis is used to identify or gain interesting insights into the way customers buy items. It finds out the combination of items which are most likely to be bought together. For example buying milk and bread together can lead to buying butter. Identifying such patterns from the transaction data collected over the years provide new window of opportunity to explore and understand the purchasing behavior and thus can be used to increase sales through cross selling and targeted marketing. Market basket analysis is not limited to retail sales but are applied to a wide range of areas like analyzing credit card purchases, fraud detection for insurance claims and so on. Data Mining technique called association rule mining is usually used. Algorithms like FP-Growth and Apriori algorithm are usually used for mining the transaction data to get the desired patterns.

Sep 18, 2009

MDX - Multi-Dimensional Expressions

OLAP cubes stores data in multidimensional form whereas in the relational database data is stored in the 2 dimensions: rows and columns. OLAP cubes can store data in more than these 2 dimensions. The SQL or Structured Query Language enables us to access and manipulate the data resinding in the relational database. However SQL cannot be used for the same purposes for OLAP cubes. MDX is the language used for accessing and manipulating data in the cubes.

Jul 30, 2009

Web Usage Mining

Web Usage Mining: Whenever anyone logs on to a web site the user's access information like time spent on the site, links visited, geographical location from where accessed etc are stored in the log file of the website. These data can be mined to discover user access patterns for the site. The site can then be redesigned to maximize hits and better user experience like having the most accessed link to be available on the front page of the site. This also provides patterns which can be used to run successful ad campaigns on the site.

Jul 24, 2009

Web Mining

The WWW has revolutionize the way in which people interact, carry out their works and gather information. It has proved itself to be a useful interface for its users to carry out such activities with ease. With hundreds of millions of people around the world using it a huge pile of data are collected everyday. These data carries interesting insights on the way people interact with it. Web Mining is the process of using various data mining techniques to analyze and discover patterns from the data.

Jul 11, 2009

OLAP And OLTP

OLAP and OLTP are two of the most common words in the world of database. However the differences between the two are not well understood. OLTP and OLAP refer to Online Transaction Processing and Online Analytical Processing respectively. The word "online" is common to both and highlights that both the systems are capable to carry out certain things almost instantaneously.

The OLTP systems are used to register the transactions occurring in real time. In a retail shop the buying information needs to be instantaneously recorded by the salesman during any purchase of goods. OLTP systems allow such transactions to be carried out within a very short span of time. Hence to process any kinds of transactions OLTP systems are used.

It is not worth anymore to merely store the transaction data. We need to carry out various operations on those data to get the real picture of various aspects of the business. Reports generated based on these data are crucial part of the Decision Support System(DSS). However such complex operations on large volume of transaction data makes the report generation very slow and puts extra load on the OLTP system in use. So to overcome these limitations OLAP are used. OLAP enables data to be stored in multidimensional forms which is often called OLAP cube. OLAP cube ensures very fast retrieval or analysis of data.

Jan 23, 2009

Data Mining in Economic Crisis

In this time of economic turmoil the whole world is slowly plunging into recession. We have seen bailouts from governments worth billions of dollars for giant companies in US, Europe and Asia. The businesses are finding hard times to operate and so ability to make smarter and intelligent business decisions are imperative to overcome in such harsh conditions.

Businesses who have already invested in business intelligence solutions will be in a better position to undertake right measures to survive and continue its growth. However, many may argue that most of the giant companies were using Data mining and BI solutions but still could not avoid the collapse. The important thing to note here is that data mining solutions provides an analytical perspective into the performance of an organization based on historical data but the economic impact on an organization is linked to many issues and in many cases to external forces and unscrupulous activities. The failure to predict this does not undermine the role of data mining for organizations but on the contrary makes it more important especially for regulatory bodies of governments to predict and identify such practices in advance and take necessary measures to avoid such circumstances in future.

Dec 15, 2008

Applications of Data Mining

Data Mining has been able to grasp the attention of many in the field of scientific research, businesses, banking sector, intelligence agencies and many others from the early days of its inception. However its use was not as easy as it is now. The rapid growth of various tools and software during the past few years enable it to be used more widely than ever before. The ease with which one can carry out complex data mining techniques using these tools is simply outstanding.

Data Mining is used by businesses to do improve its marketing and to understand the buying patterns of its clients. Attrition Analysis, Customer Segmentation and Cross Selling are the most important ways through which data mining is showing the new ways in which businesses can multiply its revenue.

Data Mining is now used in the banking sector for credit card fraud detection by identifying the patterns involved in fraudulent transactions. It is also used to reduce credit risk by classifying a potential client and predicting bad loans.

Data Mining is used by intelligence agencies like FBI and CIA to identify threats of terrorism. After the 9/11 incident it has become one of the prime means to uncover terrorist plots. However this led to concerns among the people as data collected for such works undermines the privacy of a large number of people.

Oct 31, 2008

Steps of Data Mining


There are various steps that are involved in mining data as shown in the picture.
  1. Data Integration: First of all the data are collected and integrated from all the different sources.
  2. Data Selection: We may not all the data we have collected in the first step. So in this step we select only those data which we think useful for data mining.
  3. Data Cleaning: The data we have collected are not clean and may contain errors, missing values, noisy or inconsistent data. So we need to apply different techniques to get rid of such anomalies.
  4. Data Transformation: The data even after cleaning are not ready for mining as we need to transform them into forms appropriate for mining. The techniques used to accomplish this are smoothing, aggregation, normalization etc.
  5. Data Mining: Now we are ready to apply data mining techniques on the data to discover the interesting patterns. Techniques like clustering and association analysis are among the many different techniques used for data mining.
  6. Pattern Evaluation and Knowledge Presentation: This step involves visualization, transformation, removing redundant patterns etc from the patterns we generated.
  7. Decisions / Use of Discovered Knowledge: This step helps user to make use of the knowledge acquired to take better decisions.

Oct 25, 2008

Difference between Data Mining and Data Warehousing

Data Mining provides the Enterprise with intelligence and Data Warehousing provides the Enterprise with a memory.

Data warehousing is the process that is used to integrate and combine data from multiple sources and format into a single unified schema. So it provides the enterprise with a storage mechanism for its huge amount of data. On the other hand, Data mining is the process of extracting interesting patterns and knowledge from huge amount of data. So we can apply data mining techniques on the data warehouse of an enterprise to discover useful patterns.

Oct 24, 2008

What is Data Warehousing?

An organization may have a number of different systems like TPS, ERP, Accounting Systems and so on. Each of these systems collect data but the data collected through these systems are not always in the same format. So we need to combine all these data from different sources into a single standard form and this process of combining multiple databases into a single homogenous form is called data warehousing.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.Subject-oriented

The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
Time-variant
The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time.
Non-volatile
Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting.
Integrated
The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent.

History of Data Mining


Data mining is a fairly new concept which was emerged in the late 1980s. But it soon attracted huge interests for research works and flourishes with many new and remarkable techniques being discovered throughout the 1990s. Data Mining has evolved from a number of different disciplines like statistics, machine learning, artificial intelligence, database technologies and so on.