banner

Technologies

核心技术

自然语言处理

自然语言处理是当今人工智能领域最为前沿和重要的课题之一,它能够帮助机器理解自然语言,进而实现与人类的无障碍沟通。

有光科技利用机器学习算法,实现了对英语、普通话、粤语等多语言和混合语言的自然语言处理,

并在自动问答、文本分析、知识图谱、情感分析等领域实现了广泛应用。

技术特点

  • 支持多语言和混合语言的处理
  • 支持意图识别,让机器理解人类语言
  • 理解同义词、同义句、委婉语等表达
  • 利用文本分析,构建行业知识图谱
  • 运用机器学习,支持模型实时训练

应用领域

  • 舆情分析

    舆情分析
  • 文本聊天机器人

    文本聊天机器人
  • 文本理解与分析

    文本理解与分析
  • 情感识别

    情感识别
  • 知识图谱构建

    知识图谱构建
  • 用户画像建立

    用户画像建立

科研论文

  • Efficient Learning for Undirected Topic Models

    Gu, J. and Li, V.O.K., Proc. ACL-IJCNLP, Beijing, China, July 2015.

    Replicated Softmax model, a well-known undirected topic model, is powerful in extracting semantic representations of documents. Traditional learning strategies such as Contrastive Divergence are very inefficient. This paper provides a novel estimator to speed up the learning based on Noise Contrastive Estimate, extended for documents of variant lengths and weighted inputs. Experiments on two benchmarks show that the new estimator achieves great learning efficiency and high accuracy on document retrieval and classification.


  • Learning to Translate in Real-time with Neural Machine Translation

    Gu, J., Neubig, G., Cho, K., and Li, V.O.K., arXiv:1610.00388, 2016.

    Translating in real-time, a.k.a. simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.


  • Incorporating Copying Mechanism in Sequence-to-Sequence Learning

    Gu, J., Lu, Z., Li, H., and Li, V.O.K., Proc. Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, Aug 2016.

    We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation. The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation. In this paper, we incorporate copying into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. CopyNet can nicely integrate the regular way of word generation in the decoder with the new copying mechanism which can choose sub-sequences in the input sequence and put them at proper places in the output sequence. Our empirical study on both synthetic data sets and real world data sets demonstrates the efficacy of CopyNet. For example, CopyNet can outperform regular RNN-based model with remarkable margins on text summarization tasks.


  • Trainable Greedy Decoding for the Neural Machine Translation

    Gu, J., Cho, K., Li, V.O.K., arXiv:1702.02429, 2017.

    Recent research in neural machine translation has largely focused on two aspects; neural network architectures and end-to-end learning algorithms. The problem of decoding, however, has received relatively little attention from the research community. In this paper, we solely focus on the problem of decoding given a trained neural machine translation model. Instead of trying to build a new decoding algorithm for any specific decoding objective, we propose the idea of trainable decoding algorithm in which we train a decoding algorithm to find a translation that maximizes an arbitrary decoding objective. More specifically, we design an actor that observes and manipulates the hidden state of the neural machine translation decoder and propose to train it using a variant of deterministic policy gradient. We extensively evaluate the proposed algorithm using four language pairs and two decoding objectives and show that we can indeed train a trainable greedy decoder that generates a better translation (in terms of a target decoding objective) with minimal computational overhead.


  • A Teacher-Student Framework for Zero-Resource Neural Machine Translation

    Chen Y., Liu, Y., Cheng, Y., Li, V.O.K., arXiv:1705.00753, 2017.

    While end-to-end neural machine translation (NMT) has made remarkable progress recently, it still suffers from the data scarcity problem for low-resource language pairs and domains. In this paper, we propose a method for zero-resource NMT by assuming that parallel sentences have close probabilities of generating a sentence in a third language. Based on this assumption, our method is able to train a source-to-target NMT model ("student") without parallel corpora available, guided by an existing pivot-to-target NMT model ("teacher") on a source-pivot parallel corpus. Experimental results show that the proposed method significantly improves over a baseline pivot-based model by +3.0 BLEU points across various language pairs.


  • Search Engine Guided Non-Parametric Neural Machine Translation

    Gu, J., Wang, Y., Cho, K, and Li, V.O.K., arXiv: 1705.07267, May 2017.

    In this paper, we extend an attention-based neural machine translation (NMT) model by allowing it to access an entire training set of parallel sentence pairs even after training. The proposed approach consists of two stages. In the first stage--retrieval stage--, an off-the-shelf, black-box search engine is used to retrieve a small subset of sentence pairs from a training set given a source sentence. These pairs are further filtered based on a fuzzy matching score based on edit distance. In the second stage--translation stage--, a novel translation model, called translation memory enhanced NMT (TM-NMT), seamlessly uses both the source sentence and a set of retrieved sentence pairs to perform the translation. Empirical evaluation on three language pairs (En-Fr, En-De, and En-Es) shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.


  • Universal Neural Machine Translation for Extremely Low Resource Languages

    Jiatao Gu, Hany Hassan, Jacob Devlin, Victor OK Li. Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.

    In this paper, we propose a new universal machine translation approach focusing on languages with a limited amount of parallel data. Our proposed approach utilizes a transfer-learning approach to share lexical and sentence level representations across multiple source languages into one target language. The lexical part is shared through a Universal Lexical Representation to support multilingual word-level sharing. The sentence-level sharing is represented by a model of experts from all source languages that share the source encoders with all other languages. This enables the low-resource language to utilize the lexical and sentence representations of the higher resource languages. Our approach is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences, compared to the 18 BLEU of strong baseline system which uses multilingual training and back-translation. Furthermore, we show that the proposed approach can achieve almost 20 BLEU on the same dataset through fine-tuning a pre-trained multi-lingual system in a zero-shot setting.


  • Non-Autoregressive Neural Machine Translation

    Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, Richard Socher. International Conference on Learning Representations (ICLR), 2018.

    Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.


  • Neural Machine Translation with Gumbel-Greedy Decoding

    Jiatao Gu, Daniel Jiwoong Im, Victor OK Li. AAAI Conference on Artificial Intelligence (AAAI), 2018.

    Previous neural machine translation models used some heuristic search algorithms (e.g., beam search) in order to avoid solving the maximum a posteriori problem over translation sentences at test time. In this paper, we propose the Gumbel-Greedy Decoding which trains a generative network to predict translation under a trained model. We solve such a problem using the Gumbel-Softmax reparameterization, which makes our generative network differentiable and trainable through standard stochastic gradient methods. We empirically demonstrate that our proposed model is effective for generating sequences of discrete words.


  • Travel Demand Prediction using Deep Multi-Scale Convolutional LSTM Network

    Kai Fung Chu, Albert Y.S. Lam, and Victor O.K. Li. 21st IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2018), Maui, HI, Nov. 2018.

    Mobility on Demand transforms the way people travel in the city and facilitates real-time vehicle hiring services. Given the predicted future travel demand, service providers can coordinate their available vehicles such that they are pre- allocated to the customers’ origins of service in advance to reduce waiting time. Traditional approaches on future travel demand prediction rely on statistical or machine learning methods. Advancement in sensor technology generates huge amount of data, which enables the data-driven intelligent transportation system. In this paper, inspired by deep learning techniques for image and video processing, we propose a new deep learning model, called Multi-Scale Convolutional Long Short-Term Memory (MultiConvLSTM), by considering travel demand as image pixel values. MultiConvLSTM considers both temporal and spatial correlations to predict the future travel demand. Experiments on real-world New York taxi data with around 400 million records are performed. We show that MultiConvLSTM outperforms the existing prediction methods for travel demand prediction and achieves the highest accuracy among all in both one-step and multiple-step predictions.