Text Mining

Text mining, which is sometimes referred to “text analytics”, is one way to make qualitative or “unstructured” data usable by a computer. 
Normally text comes from:
·         An estimated 80% of data is unstructured?
·         This includes emails, newspaper or web articles, internal reports, transcripts of phone calls, research papers, blog entries, and patent applications, to name a few.

The Oxford English Dictionary defines text mining as
 The process or practice of examining large collections of written resources in order to generate new information, typically using specialized computer software. It is a subset of the larger field of data mining”
Text mining refers to the process of parsing a selection or corpus of text in order to identify certain aspects, such as the most frequently occurring word or phrase.
The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms.
We can analyze words, clusters of words used in documents, etc., or we could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will "turn text into numbers" (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects.




How does it work’s?
Text mining involves the application of techniques from areas such as information retrieval, natural language processing, information extraction and data mining. These various stages of a text-mining process can be combined into a single workflow.
  • Information retrieval (IR) systems match a user’s query to documents in a collection or database. The first step in the text mining process is to find the body of documents that are relevant to the research question(s).
  • Natural language processing (NLP) analyzes the text in structures based on human speech. It allows the computer to perform a grammatical analysis of a sentence to “read” the text.
             For example:
Part-of-speech tagging classifies words into categories such as noun, verb or adjective
Word sense disambiguation identifies the meaning of a word, given its usage, from among the multiple meanings that the word may have
      The role of NLP in text mining is to provide the systems in the information extraction phase with linguistic data that they need to perform their task. Often this is done by annotating documents with information like sentence boundaries, part-of-speech tags, parsing results, which can then be read by the information extraction tools.

Three kinds of NLP components are used for our experiments, a part-of-speech tagger, a dependency parser, and a semantic role labeler.

  • Data mining (DM) is the process of identifying patterns in large sets of data, to find that new knowledge.
Applications for Text Mining
At the simplest level, all words found in the input documents will be indexed and counted in order to compute a table of documents and words, i.e., a matrix of frequencies that enumerates the number of times that each word occurs in each document. This basic process can be further refined to exclude certain common words such as "the" and "a" (stop word lists) and to combine different grammatical forms of the same words such as "traveling," "traveled," "travel," etc. (stemming). 
·         Analyzing open-ended survey responses. 
·         Automatic processing of messages, emails, etc. 
·         Analyzing warranty or insurance claims, diagnostic interviews, etc
·         Investigating competitors by crawling their web sites. 
·         Enterprise Business Intelligence/Data Mining, Competitive Intelligence
·         E-Discovery, Records Management
·         National Security/Intelligence
·         Scientific discovery, especially Life Sciences 
·         Search/Information Access
·         Social media monitoring

Why to do Text mining?
·         Enriching the Content.
·         Systematic Review of Literature.
·         Discovery.
·         Computational Linguistics Research.


SHARE

About df

    Blogger Comment
    Facebook Comment

0 comments:

Post a Comment