Fouille de texte (text mining) et fouille de données (data mining)

Définitions, enjeux

Weiss, Sholom M. and Indurkhya, Nitin & Zhang, Tong & Damerau, Fred J., Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer, New York, 2005 :

Text is often described as unstructured information. So, i would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured numerical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for predictive data mining can be applied to text. Yet, there are key differences. Evalaution techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the date are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modified to accomodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Our view of text mining allows us to unify the concepts of different fields. No longer is "natural language processing" the sole domain of linguists and their allied computer specialists. No longer is search engine technology distinct from other forms of machine learning. Ours is an open view. We welcome you to try your hand at learning from data, whether numerical or text. You need not have a Ph.D. in linguistics to work in this area.
Not everyone will agree with our perspective. The natural language specialist may argue that ours is shallow view of text that will solve some problems, but the bigger problemts, such as answering questions posed by a user, can only be solved with a deeper undertsanding of language.
