The corpuses listed above are all spoken Cantonese, which are what I'm most interested in at the moment. Here are their full names and some more information:
Hong Kong Cantonese Adult language Corpus (HKCAC). a 170,000-character corpus based on phone-in programs and forums aired on Hong Kong radio.
citation: Leung, M-T., and Law, S-P. (2002). HKCAC: The Hong Kong Cantonese Adult Language Corpus. International Journal of Corpus Linguistics 6.2:305-325.
Hong Kong University Cantonese Corpus (HKUCC) was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was word- segmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). About 29 hours of tape-recordings, and approximately 200,000 Chinese characters were collected in the corpus.
There is also a large Hong Kong newspaper corpus of (presumably standard Chinese formal) written material, but I haven't followed-through on acquiring the corpus or its word/character list. Here is some information on this corpus to get you started:
The Segmentation Corpus is an electronic database of around 33,000 Cantonese word types extracted from a 1.7 million character corpus of Hong Kong newspapers, along with a tokenized record of the text. It is described in more detail in Chan & Tang (1999).
The wordlist proper, a file containing a separate entry for each word type identified by the segmentation criteria. Each entry has three fields: the orthographic form(s), the pronunciation(s) in Jyutping, and the token frequency in the segmented newspaper corpus.
citation: Chan S. D. and Tang Z. X. (1999) Quantitative Analysis of Lexical Distribution in Different Chinese Communities in the 1990’s. Yuyan Wenzi Yingyong [Applied Linguistics], No.3, 10-18.
This is also seems like an interesting book on non-standard written Cantonese:
Don Snow, Cantonese as Written Language: The Growth of a Written Chinese Vernacular.
Hong Kong Cantonese Adult language Corpus (HKCAC). a 170,000-character corpus based on phone-in programs and forums aired on Hong Kong radio.
citation: Leung, M-T., and Law, S-P. (2002). HKCAC: The Hong Kong Cantonese Adult Language Corpus. International Journal of Corpus Linguistics 6.2:305-325.
Hong Kong University Cantonese Corpus (HKUCC) was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was word- segmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). About 29 hours of tape-recordings, and approximately 200,000 Chinese characters were collected in the corpus.
There is also a large Hong Kong newspaper corpus of (presumably standard Chinese formal) written material, but I haven't followed-through on acquiring the corpus or its word/character list. Here is some information on this corpus to get you started:
The Segmentation Corpus is an electronic database of around 33,000 Cantonese word types extracted from a 1.7 million character corpus of Hong Kong newspapers, along with a tokenized record of the text. It is described in more detail in Chan & Tang (1999).
The wordlist proper, a file containing a separate entry for each word type identified by the segmentation criteria. Each entry has three fields: the orthographic form(s), the pronunciation(s) in Jyutping, and the token frequency in the segmented newspaper corpus.
citation: Chan S. D. and Tang Z. X. (1999) Quantitative Analysis of Lexical Distribution in Different Chinese Communities in the 1990’s. Yuyan Wenzi Yingyong [Applied Linguistics], No.3, 10-18.
This is also seems like an interesting book on non-standard written Cantonese:
Don Snow, Cantonese as Written Language: The Growth of a Written Chinese Vernacular.