Jul 1, 2012

Challenges of Big Data


Volume:
According to a study we produce around 2.5 quintillion bytes of data each day. We can no longer use the existing database management tools on such rapidly growing data. We will experience difficulties to capture, store, search, share, analyze and visualize data. Managing large volume and rapidly increasing database is a challenging issue and cannot be anymore mitigated with faster processors. The volume of data is exceeding the processing capacity. Moreover, the algorithms necessary to process the data are much more complex than usual ones making analysis on big data very complicated.
Velocity:
Some scenarios require analysis of data to be presented immediately. For example: fraudulent credit card transaction should be flagged before a transaction even occurs. The larger the data set to be processed the longer it will take to analyze. How we can capture the most important data as it happens? How quickly does the data move across the enterprise when you need to make a decision and deliver that to the right people in real-time?
Variety:
Big data when it comes in an unstructured format, such as text or video. Machines or algorithms expect homogenous data and so require data to be prepared before analysis. To apply appropriate filters that can reduce the size of raw data and at the same time ensures the availability of necessary data. Even after initial data cleaning some errors and incompleteness may exist in the data which must be handled during data analysis.

What is Big Data?


Big data is defined by its 3 characteristics which is also called 3 V’s:
1.        Volume: The increase in data volumes within enterprise systems is caused by transaction volumes and other traditional data types, as well as by new types of data.
2.       Velocity: This involves streams of data, structured record creation, and availability for access and delivery. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand.
3.       Variety: It includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, still images, audio, stock ticker data, financial transactions and more.

May 3, 2010

ROLAP, MOLAP and HOLAP


There are three different types of OLAP technology:

ROLAP stands for Relational-OLAP as it stores all the data in relational database. When we run the MDX queries it retrieves the data from the relational database using a corresponding SQL query. This makes the query processing to be slow but it does provide faster processing time.

MOLAP stands for Multidimensional-OLAP as it stores all the data in the multidimensional data structure of the Analysis Services. It has slow processing time but provides better query performance than ROLAP. MOLAP storage mode is most widely used as OLAP is meant to provide the faster query performance.

HOLAP stands for Hybrid-OLAP as it uses a combination of both the ROLAP and MOLAP. It stores only the indexes and aggregations in the multidimensional form while the rest of the data is stored in the relational database.

Mar 5, 2010

Time Dimension in Data Warehouse


In any data warehouse date and time plays a vital role in keeping track of when a particular event has taken place. However, we do not store the time information in the same way as we do in the OLTP database. In a dimensional model a separate time dimension is normally used. This time dimension contains separate attributes for year, month, and day and so on. This time dimension is initially populated with the data of a particular period like ten years. Whenever an event/transaction needs to be recorded in the fact table the time key from this dimension for the corresponding time is obtained from the time dimension.

Surrogate key for time dimension is auto incremented just like any other dimension but it is always helpful to have it the right format like for date: 05-Mar-2010 the time key should be 05032010. This proves to be extremely useful as queries can be done directly based on the time key and thus improves performance.
The following figures show how to create and populate time dimension in Sql Server Analysis Services (SSAS):





Nov 20, 2009

Data Warehouse Design

Data Warehouse can be designed in two different ways based on Ralph Kimball and Bill Inmon's model.
In Inmon's architecture data from OLTP databse is stored in a warehouse before it is transferred to the data marts. In Inmon's view the data warehouse is a real database.

According to Kimball's architecture data is transferred directly from OLTP database to the data mart. In Kimball's model the data warehouse is actually a collection of integrated data marts.

Nov 12, 2009

SQL Server Replication

Replication is the process of creating and maintaining a separate copy of database in another server or the same server and keeping both synchronized. Replication is useful in a number of scenario:

1. To integrate multiple database to build a central warehouse: Businesses operating on a number of distant locations usually collect data in separate database. However the senior management needs to generate reports showing analysis of data collected from all these sites. Hence the need for central warehouse is increasing and replication proves to be a powerful way to create and maintain the warehouse. It can also be used to create and maintain a separate data mart for OLAP operations.

2. To ensure higher availability and improved performance of web applications: Web applications with the increasing number of users faces huge problem in avoiding the extra load on the database. Replication along with other techniques can offer the right solution by distributing users to a number of replicated database in different servers and thus improving the performance and ensure high availability.

Sep 28, 2009

OLAP Video Tutorial

This is an exciting video I came across where OLAP concepts are explained in a easy to understand way using cartoons.

Sep 26, 2009

Dimensional Modeling


Dimensional Modeling refers to the technique used to design logical models for database which can support OLAP and Data Warehousing operations. It differs from the relational modeling or ER modeling in the sense that it can be used for data warehouse as well as relational databases.

Dimensional modeling is based on two concepts called facts and dimensions. Facts are usually numeric values or something that can be measured or aggregated. For example: monthly sales amount. Dimensions are groups of hierarchies and attributes that are used to define facts. For example product and its color. In a dimensional model a group of facts forms a fact table while a group of dimensions form a dimension table.


Sep 20, 2009

Market Basket Analysis


Market Basket Analysis is used to identify or gain interesting insights into the way customers buy items. It finds out the combination of items which are most likely to be bought together. For example buying milk and bread together can lead to buying butter. Identifying such patterns from the transaction data collected over the years provide new window of opportunity to explore and understand the purchasing behavior and thus can be used to increase sales through cross selling and targeted marketing. Market basket analysis is not limited to retail sales but are applied to a wide range of areas like analyzing credit card purchases, fraud detection for insurance claims and so on. Data Mining technique called association rule mining is usually used. Algorithms like FP-Growth and Apriori algorithm are usually used for mining the transaction data to get the desired patterns.

Sep 18, 2009

MDX - Multi-Dimensional Expressions

OLAP cubes stores data in multidimensional form whereas in the relational database data is stored in the 2 dimensions: rows and columns. OLAP cubes can store data in more than these 2 dimensions. The SQL or Structured Query Language enables us to access and manipulate the data resinding in the relational database. However SQL cannot be used for the same purposes for OLAP cubes. MDX is the language used for accessing and manipulating data in the cubes.

Jul 30, 2009

Web Usage Mining

Web Usage Mining: Whenever anyone logs on to a web site the user's access information like time spent on the site, links visited, geographical location from where accessed etc are stored in the log file of the website. These data can be mined to discover user access patterns for the site. The site can then be redesigned to maximize hits and better user experience like having the most accessed link to be available on the front page of the site. This also provides patterns which can be used to run successful ad campaigns on the site.

Jul 24, 2009

Web Mining

The WWW has revolutionize the way in which people interact, carry out their works and gather information. It has proved itself to be a useful interface for its users to carry out such activities with ease. With hundreds of millions of people around the world using it a huge pile of data are collected everyday. These data carries interesting insights on the way people interact with it. Web Mining is the process of using various data mining techniques to analyze and discover patterns from the data.

Jul 11, 2009

OLAP And OLTP

OLAP and OLTP are two of the most common words in the world of database. However the differences between the two are not well understood. OLTP and OLAP refer to Online Transaction Processing and Online Analytical Processing respectively. The word "online" is common to both and highlights that both the systems are capable to carry out certain things almost instantaneously.

The OLTP systems are used to register the transactions occurring in real time. In a retail shop the buying information needs to be instantaneously recorded by the salesman during any purchase of goods. OLTP systems allow such transactions to be carried out within a very short span of time. Hence to process any kinds of transactions OLTP systems are used.

It is not worth anymore to merely store the transaction data. We need to carry out various operations on those data to get the real picture of various aspects of the business. Reports generated based on these data are crucial part of the Decision Support System(DSS). However such complex operations on large volume of transaction data makes the report generation very slow and puts extra load on the OLTP system in use. So to overcome these limitations OLAP are used. OLAP enables data to be stored in multidimensional forms which is often called OLAP cube. OLAP cube ensures very fast retrieval or analysis of data.

Jan 23, 2009

Data Mining in Economic Crisis

In this time of economic turmoil the whole world is slowly plunging into recession. We have seen bailouts from governments worth billions of dollars for giant companies in US, Europe and Asia. The businesses are finding hard times to operate and so ability to make smarter and intelligent business decisions are imperative to overcome in such harsh conditions.

Businesses who have already invested in business intelligence solutions will be in a better position to undertake right measures to survive and continue its growth. However, many may argue that most of the giant companies were using Data mining and BI solutions but still could not avoid the collapse. The important thing to note here is that data mining solutions provides an analytical perspective into the performance of an organization based on historical data but the economic impact on an organization is linked to many issues and in many cases to external forces and unscrupulous activities. The failure to predict this does not undermine the role of data mining for organizations but on the contrary makes it more important especially for regulatory bodies of governments to predict and identify such practices in advance and take necessary measures to avoid such circumstances in future.

Dec 15, 2008

Applications of Data Mining

Data Mining has been able to grasp the attention of many in the field of scientific research, businesses, banking sector, intelligence agencies and many others from the early days of its inception. However its use was not as easy as it is now. The rapid growth of various tools and software during the past few years enable it to be used more widely than ever before. The ease with which one can carry out complex data mining techniques using these tools is simply outstanding.

Data Mining is used by businesses to do improve its marketing and to understand the buying patterns of its clients. Attrition Analysis, Customer Segmentation and Cross Selling are the most important ways through which data mining is showing the new ways in which businesses can multiply its revenue.

Data Mining is now used in the banking sector for credit card fraud detection by identifying the patterns involved in fraudulent transactions. It is also used to reduce credit risk by classifying a potential client and predicting bad loans.

Data Mining is used by intelligence agencies like FBI and CIA to identify threats of terrorism. After the 9/11 incident it has become one of the prime means to uncover terrorist plots. However this led to concerns among the people as data collected for such works undermines the privacy of a large number of people.