This project will develop algorithms to automatically generate descriptive labels for large collections of web documents.
Postdoctoral fellow: Dr. Mathieu Sinn, David R. Cheriton School of Computer Science, University of Waterloo
Lead faculty member: Dr. Pascal Poupart, David R. Cheriton School of Computer Science, University of Waterloo
We will develop algorithms to automatically generate descriptive labels for large collections of web documents. Such labels can be used by companies in order to decide on which web sites they want to place advertisements, or by electronic publishers to categorize media offers. Currently, there doesn't exist any approach that can robustly and automatically label clusters of documents with a level of quality that approaches human labellers. Since the main difficulties are to capture the underlying concepts of a group of documents and to express them in a short human readable phrase, we will develop statistical topic models that leverage the online encyclopaedia Wikipedia to produce high quality labels. Google has an immediate need for such an approach to improve their internal use of document clusters and may develop new commercial services that depend on the availability of a fully automated, high quality labelling technique.