IR-472: (2006) Bekkerman, R.,  Eguchi, K. and Allan, J., "Unsupervised Non-topical Classification of Documents," CIIR Technical Report. [View bibtex]

We describe the problem of non-topical clustering of documents, the purpose of which is to divide a set of documents into clusters that share some aspect. We present experiments on the British National Corpus that cluster documents by genre. We show that words are superior to part of speech information for genre clustering, but that better results can be obtained by using both. We also demonstrate that the new multi-way distributional clustering approach is highly effective for this task because it requires less feature crafting than other techniques.