'cnews.train.txt' data cannot be uploaded because it is too large, so
it needs to be decompressed and imported after compression.
Use SVM to implement a simple text classification based on bag of
words and support vector machine.
import data
1 2 3 4
# import import codecs import os import jieba
Chinese news data is prepared as a sample data set. The number of
training data is 50,000 and the number of test data is 10,000. All data
is divided into 10 categories: sports, finance, real estate, home
furnishing, education, technology, fashion, current affairs, games and
entertainment . From the training text, you can load the code, view the
data format and samples:
1 2 3 4 5 6 7 8 9 10 11
data_train = './data/cnews.train.txt'# training data file name data_test = './data/cnews.test.txt'# test data file name vocab = './data/cnews.vocab.txt'# dictionary
with codecs.open(data_train, 'r', 'utf-8') as f: lines = f.readlines()
Take the first item of the training data as an example to segment the
loaded news data. Here I use the word segmentation function of LTP, you
can also use jieba, and the segmentation results are displayed separated
by "/" symbols.
1 2 3
# print word segment results segment = jieba.cut(content) print('/'.join(segment))
To sort out the above logic a bit, implement a class to load training
and test data and perform word segmentation.
# cut data defprocess_line(idx, line): data = tuple(line.strip('\r\n').split('\t')) ifnotlen(data)==2: returnNone content_segged = list(jieba.cut(data[1])) if idx % 1000 == 0: print('line number: {}'.format(idx)) return (data[0], content_segged)
# data loading method defload_data(file): with codecs.open(file, 'r', 'utf-8') as f: lines = f.readlines() data_records = [process_line(idx, line) for idx, line inenumerate(lines)] data_records = [data for data in data_records if data isnotNone] return data_records
# load and process training data train_data = load_data(data_train) print('first training data: label {} segment {}'.format(train_data[0][0], '/'.join(train_data[0][1]))) # load and process testing data test_data = load_data(data_test) print('first testing data: label {} segment {}'.format(test_data[0][0], '/'.join(test_data[0][1])))
After spending some time on word segmentation, you can start building
a dictionary. The dictionary is built from the training set and sorted
by word frequency.
defbuild_vocab(train_data, thresh): vocab = {'<UNK>': 0} word_count = {} # word frequency for idx, data inenumerate(train_data): content = data[1] for word in content: if word in word_count: word_count[word] += 1 else: word_count[word] = 1 word_list = [(k, v) for k, v in word_count.items()] print('word list length: {}'.format(len(word_list))) word_list.sort(key = lambda x : x[1], reverse = True) # sorted by word frequency word_list_filtered = [word for word in word_list if word[1] > thresh] print('word list length after filtering: {}'.format(len(word_list_filtered))) # construct vocab for word in word_list_filtered: vocab[word[0]] = len(vocab) print('vocab size: {}'.format(len(vocab))) # vocab size is word list size +1 due to unk token return vocab
vocab = build_vocab(train_data, 1)
In addition, according to category, we know that the label itself
also has a "dictionary":
1 2 3 4 5 6 7 8 9 10
defbuild_label_vocab(cate_file): label_vocab = {} with codecs.open(cate_file, 'r', 'utf-8') as f: for lines in f: line = lines.strip().split('\t') label_vocab[line[0]] = int(line[1]) return label_vocab
Next, construct the id-based training and test sets, because we only
consider the bag of words, so the order of words is excluded.
Constructed to look like libsvm can eat. Note that because the bag of
word model
The remaining core model is simple: use libsvm to train the support
vector machine, let your svm eat the training and test files you have
processed, and then use the existing method of libsvm to train, we can
change different parameter settings . The documentation of libsvm can be
viewed here,
where the "-s, -t, -c" parameters are more important, and they decide
what you choose Svm, your choice of kernel function, and your penalty
coefficient.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from libsvm import svm from libsvm.svmutil import svm_read_problem,svm_train,svm_predict,svm_save_model,svm_load_model
After a period of training, we can observe the experimental results.
You can change different svm types, penalty coefficients, and kernel
functions to optimize the results.