Ubuntu Chat Corpus

What is the Ubuntu Chat Corpus?

The Ubuntu Chat Corpus (UCC) is composed of archived chat logs from Ubuntu's Internet Relay Chat technical support channels. Ubuntu uses IRC as one of many modes of technical support -- it offers real-time problem solving. We have taken some of the archived messages (which are in the public domain), reorganized the file structure, removed some unnecessary system messages, and compressed them to make it easier to obtain. More details can be found in our AAAI 2013 Spring Symposium paper: Uthus, D.C., & Aha, D.W. (2013) The Ubuntu Chat Corpus for Multiparticipant Chat Analysis.


Full corpus: Version 2012-12-11
Preview (one day worth of messages from primary support channel): Preview

Labeled Corpora


Corpus of chat messages composed of two sets: one set of unlabeled messages for training, one set of labeled messages for testing in regards to relevance to "Unity". This labeled corpus is described in more detail in our FLAIRS 2013 paper: Uthus, D.C., & Aha, D.W. (2013) Extending Word Highlighting in Multiparticipant Chat.
[Unity Corpus]


Corpus of chat messages labeled for whether they are human-answerable questions (HAQ) or a bot-answerable questions (BAQ), with the later case meaning the question can be answered with a factoid. Also included is a file with the grouping of factoid commands (oftentimes, a factoid can be accessed using different commands). This labeled corpus is described in more detail in our soon-to-be-published IJCNLP 2013 paper: Uthus, D.C., & Aha, D.W. (2013) Detecting Bot-Answerable Questions in Ubuntu Chat.
[BAQ Corpus]
Updated: 24th August, 2013 ⚙ Copyright © 2013 David UthusValid HTML 4.01 Strict