Text
mining, which is sometimes referred to “text analytics”, is one way to make
qualitative or “unstructured” data usable by a computer.
Normally text comes from:
·
An estimated 80% of
data is unstructured?
·
This includes
emails, newspaper or web articles, internal reports, transcripts of phone
calls, research papers, blog entries, and patent applications, to name a few.
The
Oxford English Dictionary defines text mining as
” The process or practice of examining large
collections of written resources in order to generate new information,
typically using specialized computer software. It is a subset of the
larger field of data mining”
Text
mining refers to the process of parsing a selection or corpus of text in order
to identify certain aspects, such as the most frequently occurring word or
phrase.
The purpose of Text Mining is to process unstructured
(textual) information, extract meaningful numeric indices from the text, and,
thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms.
We can analyze words, clusters of words used
in documents, etc., or we could analyze documents and determine similarities
between them or how they are related to other variables of interest in the data
mining project. In the most general terms, text mining will "turn text
into numbers" (meaningful indices), which can then be incorporated in
other analyses such as predictive
data mining projects.
How does it
work’s?
Text mining
involves the application of techniques from areas such as information
retrieval, natural language processing, information extraction and data mining.
These various stages of a text-mining process can be combined into a single
workflow.
- Information retrieval (IR) systems match a user’s query to documents in a
collection or database. The first step in the text mining process is to
find the body of documents that are relevant to the research question(s).
- Natural language
processing (NLP) analyzes
the text in structures based on human speech. It allows the computer to
perform a grammatical analysis of a sentence to “read” the text.
For example:
Part-of-speech tagging classifies words into categories
such as noun, verb or adjective
Word sense disambiguation identifies the meaning of a
word, given its usage, from among the multiple meanings that the word may have
The role of NLP in text mining is to
provide the systems in the information extraction phase with linguistic data
that they need to perform their task. Often this is done by annotating
documents with information like sentence boundaries, part-of-speech tags,
parsing results, which can then be read by the information extraction tools.
Three
kinds of NLP components are used for our experiments, a part-of-speech tagger,
a dependency parser, and a semantic role labeler.
- Data mining (DM) is the process of identifying patterns in large
sets of data, to find that new knowledge.
Applications
for Text Mining
At the simplest level, all words found in the input documents
will be indexed and counted in order to compute a table of documents and words,
i.e., a matrix of frequencies that enumerates the number of times that each
word occurs in each document. This basic process can be further refined to
exclude certain common words such as "the" and "a" (stop
word lists) and to combine different grammatical forms of the same words such
as "traveling," "traveled," "travel," etc. (stemming).
·
Analyzing
open-ended survey responses.
·
Automatic
processing of messages, emails, etc.
·
Analyzing
warranty or insurance claims, diagnostic interviews, etc
·
Investigating
competitors by crawling their web sites.
·
Enterprise Business Intelligence/Data Mining, Competitive Intelligence
·
E-Discovery, Records Management
·
National Security/Intelligence
·
Scientific discovery, especially Life Sciences
·
Search/Information Access
·
Social media monitoring
Why to do Text
mining?
·
Enriching the Content.
·
Systematic Review of Literature.
·
Discovery.
·
Computational Linguistics Research.
0 comments:
Post a Comment