Improving Neural Chinese Word Segmentation Using Unlabeled Data
Yanna ZhangJinan XuGuoyi MiaoYufeng ChenYujie Zhang
School of Computer and Information Technology, Beijing Jiaotong University
摘要:Supervised word segmentation heavily relies on large-scale and high quality labeled data. However, building such a corpus is difficult, especially with respect to domain specific data. In this paper, we propose a novel semi-supervised Chinese word segmentation(CWS) method. Specifically, we seek to select more useful sample sentences from the large-scale unlabeled sentences to extend the training data, by means of a sampling strategy that uses character-based semantic similarity. The presented similarity algorithm is used to calculate the similarity between unlabeled sentences and the training data, which can help select helpful sample sentences from unlabeled data. In addition, we integrate an attention mechanism into our word segmentation model to focus on available contextual information. Experiments on PKU, MSR and Weibo benchmark data sets show that our method outperforms the previous neural network models and state-of-the-art methods.
会议名称:
2018 2nd International Conference on Artificial Intelligence Applications and Technologies (AIAAT2018)
会议时间:
2018-08-08
会议地点:
中国上海
- 专辑:
信息科技
- 专题:
计算机软件及计算机应用
- 分类号:
TP391.1
引文网络
- 参考文献
- 引证文献
- 共引文献
- 同被引文献
- 二级参考文献
- 二级引证文献
- 批量下载
相关推荐
- 相似文献
- 读者推荐
- 相关基金文献
- 关联作者
- 相关视频