среда, 25 ноября 2009 г.

Data mining

The extraction of hidden predictive information from large databases.
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.
Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries.

Four types of relationships in data mining:



  • Classes: Stored data is used to locate data in predetermined groups.

  • Clusters: Data items are grouped according to logical relationships or consumer preferences.

  • Associations: Data can be mined to identify associations.

  • Sequential patterns: Data is mined to anticipate behavior patterns and trends.
Techniques




  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .

  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.

http://www.thearling.com/index.htm#wps - Information about data mining and analytic technologies
http://www.statsoft.com/textbook/stdatmin.html - white papers about data mining, etc.
http://www.thearling.com/dmintro/dmintro_frame.htm - data mining introduction presentation

Data schemas






Star schema



The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.



Model
The facts that the data warehouse helps analyze are classified along different dimensions: the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed.






Snowflake schema



In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are denormalized with each dimension being represented by a single table.



The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations.

вторник, 24 ноября 2009 г.

Data Warehouse

Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis

Data warehouse architecture
  • Operational database layer

The source data for the data warehouse - An organization's Enterprise Resource Planning systems fall into this layer.

  • Data access layer

The interface between the operational and informational access layer - Tools to extract, transform, load data into the warehouse fall into this layer.

  • Metadata layer

The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

  • Informational access layer

The data accessed for reporting and analyzing and the tools for reporting and analyzing data - Business intelligence tools fall into this layer.

Benefits of data warehousing

  • A common data model which makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.
  • Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis.
  • The information in the warehouse can be stored safely for extended periods of time.
  • Data warehouses provide retrieval of data without slowing down operational systems.

Disadvantages of data warehouses

  • Not the optimal environment for unstructured data.
  • There is an element of latency in data warehouse data.
  • Maintenance costs are high.
  • Data warehouses can get outdated relatively quickly.
  • Duplicate of functionality that must be developed in the data warehouse in the operational systems

SSIS & SSRS



SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software which can be used to perform a broad range of data migration tasks.
SSIS is a platform for data integration and workflow applications. It features a fast and flexible data warehousing tool used for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases and updates to multidimensional cube data.


Integration Services includes

  • graphical tools and wizards for building and debugging packages;
  • tasks for performing workflow functions such as FTP operations, SQL statement execution, and e-mail messaging;
  • data sources and destinations for extracting and loading data;
  • transformations for cleaning, aggregating, merging, and copying data;
  • a management service, the Integration Services service, for administering Integration Services packages;
  • and application programming interfaces (APIs) for programming the Integration Services object model.

Typical Uses of Integration Services

  • Merging Data from Heterogeneous Data Stores
  • Populating Data Warehouses and Data Marts
  • Cleaning and Standardizing Data
  • Building Business Intelligence into a Data Transformation Process
  • Automating Administrative Functions and Data Loading

SSRS - SQL Server Reporting Services

SSRS provides a full range of ready-to-use tools and services to help you create, deploy, and manage reports for your organization, as well as programming features that enable you to extend and customize your reporting functionality.

ETL

Extract, transform, load
The functions performed when pulling data out of one database and placing it into another of a different type. ETL is used to migrate data, often from relational databases into decision support systems.
ETL, EII and EAI
ETL differs from EII (enterprise information integration), which aggregates data temporarily from disparate data sources, and EAI (enterprise application integration), which integrates data from one application to another. See EII and EAI.

Real-life ETL cycle


The typical real-life ETL cycle consists of the following execution steps:

  1. Cycle initiation
  2. Build reference data
  3. Extract (from sources)
  4. Validate
  5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
  6. Stage (load into staging tables, if used)
  7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
  8. Publish (to target tables)
  9. Archive
  10. Clean up

Challenges

  • The range of data values or data quality
  • Data warehouses typically grow asynchronously
  • Increasing volumes of data
  • Real-time change-data capture

Performance

In real life, the slowest part of an ETL process usually occurs in the database load phase.

  • Do most of the ETL processing outside of the database
  • Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning).
  • Do all validation in the ETL layer before the load. Disable integrity checking in the target database tables during the load.
  • Disable triggers in the target database tables during the load. Simulate their effect as a separate step.
  • Generate IDs in the ETL layer (not in the database).
  • Drop the indexes (on a table or partition) before the load - and recreate them after the load
  • Use parallel bulk load when possible
  • You often can do bulk load for inserts, but updates and deletes.

Parallel processing

  • Data: By splitting a single sequential file into smaller data files to provide parallel access.
  • Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.
  • Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

понедельник, 23 ноября 2009 г.

VLDB - Topics of Interest

Core Database Technology Track
  • Concurrency Control,
  • Data Models and Languages,
  • Embedded and Mobile Databases,
  • Fuzzy, Probabilistic, and Approximate Data,
  • Native Semi-Structured Data and XML,
  • Parallel, Distributed, and Grid Databases,
  • Spatial, Temporal & Multimedia Databases,
  • Stream Databases.

Infrastructure for Information Systems Track

  • Data Design, Evolution and Migration,
  • Heterogeneous and Federated DBMS (Interoperability),
  • Information Filtering and Dissemination,
  • Mobile Data Management,
  • Novel/Advanced Applications,
  • On-Line Analytic Processing,
  • P2P and Networked Data Management,
  • Profile-based Data Management,
  • Profile-based Data Management,
  • Sensor Networks,
  • Sensor Networks,
  • Social Systems and Recommendations

Industrial Applications and Experience Track

  • Electronic Commerce,
  • Geographic Information Systems,
  • Industrial-Strength Systems based on DB Technology,
  • Mobile Computing,
  • Self-Managing Systems,
  • System Design and Implementation using DB Technology.

VLDB - Very Large Data Base

Today I was interested in VLDB abbriviation.
(Very Large DataBase) An extremely large database. What constitutes a VLDB is debatable, but databases of 100GB or more are generally considered the starting point. A VLDB is also measured by the number of transactions per second and the type of complex queries that it can support.

Tuple - (1) In a relational database, a tuple is one record (one row).

(2) A set of values passed from one programming language to another application program or to a system program such as the operating system. Typically separated by commas, the values may be parameters for a function call or a set of data values for a database.


In database research, VLDB commonly refers to the non-profit VLDB Endowment and its annual research conference, International Conference on Very Large Data Bases.

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users.

2009 conference was in Lyon.

2010 will be in Singapore (http://www.vldb2010.org/)


Conference seems to be quite interesting. And I'd like to go to Singapore.