banner

Technologies

核心技术

大数据分析

大数据分析技术涵盖数据分析、数据挖掘、数据安全等细分领域,是企业在信息时代提升核心竞争力的重要工具。

有光科技基于语音和自然语言处理技术,为企业客户提供多领域、深层次的大数据分析解决方案,

对各个行业的海量数据进行深度分析,充分挖掘数据背后的商业价值。

技术特点

  • 基于人工智能,实现语音及文字挖掘
  • 完善的数据处理流程,确保数据安全
  • 涵盖金融、政务、通讯、零售等行业
  • 专业的数据生产系统和标注工具
  • 支持可视化分析,让数据自己说话

应用领域

  • 智能物联

    智能物联
  • 数据标注

    数据标注
  • 用户画像分析

    用户画像分析
  • 销售机会挖掘

    销售机会挖掘
  • 智慧城市

    智慧城市
  • 商业智能

    商业智能

科研论文

  • Performance Models of Access Latency in Cloud Storage Systems.

    Shuai, Q., Li, V.O.K., and Zhu, Y., Proc. Fourth Workshop on Architectures and Systems for Big Data, Minneapolis, MN, US, June 14, 2014.

    Access latency is a key performance metric for cloud storage systems and has great impact on user experience, but most papers focus on other performance metrics such as storage overhead, repair cost and so on. Only recently do some models argue that coding can reduce access latency. However, they are developed for special scenarios, which may not reflect reality. To fill the gaps between existing work and practice, in this paper, we propose a more practical model to measure access latency. This model can also be used to compare access latency of different codes used by different companies. To the best of our knowledge, this model is the first to provide a general method to compare access latencies of different erasure codes.


  • Granger-Causality-Based Air Quality Estimation with Spatio-Temporal (S-T) Heterogeneous Big Data

    Zhu, Y., Sun. C., and Li, V.O.K., Proc. IEEE INFOCOM Smart City Workshop, Hong Kong, China, April 2015.

    This paper considers city-wide air quality estimation with limited available monitoring stations which are geographically sparse. Since air pollution is highly spatio-temporal (S-T) dependent and considerably influenced by urban dynamics (e.g., meteorology and traffic), we can infer the air quality not covered by monitoring stations with S-T heterogeneous urban big data. However, estimating air quality using S-T heterogeneous big data poses two challenges. The first challenge is due to with the data diversity, i.e., there are different categories of urban dynamics and some may be useless and even detrimental for the estimation. To overcome this, we first propose an S-T extended Granger causality model to analyze all the causalities among urban dynamics in a consistent manner. Then by implementing non-causality test, we rule out the urban dynamics that do not “Granger” cause air pollution. The second challenge is due to the time complexity when processing the massive volume of data. We propose to discover the region of influence (ROI) by selecting data with the highest causality levels spatially and temporally. Results show that we achieve higher accuracy using “part” of the data than “all” of the data. This may be explained by the most influential data eliminating errors induced by redundant or noisy data. The causality model observation and the city-wide air quality map are illustrated and visualized using data from Shenzhen, China.


  • A Gaussian Bayesian Model to Identify Spatio-temporal Causalities for Air Pollution Based on Urban Big Data

    Zhu, J. Y., Zheng, Y., Yi, X., and Li, V.O.K., SmartCity16: The 2nd IEEE INFOCOM Workshop on Smart Cities and Urban Computing, San Francisco, California, USA, April 2016.

    Identifying the causalities for air pollutants and answering questions, such as, where do Beijing's air pollutants come from, are crucial to inform government decision-making. In this paper, we identify the spatio-temporal (ST) causalities among air pollutants at different locations by mining the urban big data. This is challenging for two reasons: 1) since air pollutants can be generated locally or dispersed from the neighborhood, we need to discover the causes in the ST space from many candidate locations with time efficiency; 2) the cause-and-effect relations between air pollutants are further affected by confounding variables like meteorology. To tackle these problems, we propose a coupled Gaussian Bayesian model with two components: 1) a Gaussian Bayesian Network (GBN) to represent the cause-and-effect relations among air pollutants, with an entropy-based algorithm to efficiently locate the causes in the ST space; 2) a coupled model that combines cause-and-effect relations with meteorology to better learn the parameters while eliminating the impact of confounding. The proposed model is verified using air quality and meteorological data from 52 cities over the period Jun 1st 2013 to May 1st 2015. Results show superiority of our model beyond baseline causality learning methods, in both time efficiency and prediction accuracy.


  • A Four-Layer Architecture for Online and Historical Big Data Analytics

    Zhu, J. Y., Xu, J, and Li, V.O.K., Proc. IEEE DataCom, Oakland, New Zealand, Aug 2016.

    Big data processing and analytics technologies have drawn much attention in recent years. However, the recent explosive growth of online data streams brings new challenges to the existing technologies. These online data streams tend to be massive, continuously arriving, heterogeneous, time-varying and unbounded. Therefore, it is necessary to have an integrated approach to process both big static data and online big data streams. We call this integrated approach online and historical big data analytics (OHBDA). We propose a four-layer architecture of OHBDA, i.e. including the storage layer, online and historical data processing layer, analytics layer, and decision-making layer. Functionalities and challenges of the four layers are further discussed. We conclude with a discussion of the requirements for the future OHBDA solutions, which may serve as a foundation for future big data analytics research.


  • Pg-Causality: Identifying Spatiotemporal Causal Pathways for Air Pollutants with Urban Big Data

    Zhu, J.Y., Zhang, C., Zhi, S., Li, V.O.K., Han, J., Zheng, Y., arXiv:1610.07045, 2016.

    Many countries are suffering from severe air pollution. Understanding how different air pollutants accumulate and propagate is critical to making relevant public policies. In this paper, we use urban big data (air quality data and meteorological data) to identify the \emph{spatiotemporal (ST) causal pathways} for air pollutants. This problem is challenging because: (1) there are numerous noisy and low-pollution periods in the raw air quality data, which may lead to unreliable causality analysis, (2) for large-scale data in the ST space, the computational complexity of constructing a causal structure is very high, and (3) the \emph{ST causal pathways} are complex due to the interactions of multiple pollutants and the influence of environmental factors. Therefore, we present \emph{p-Causality}, a novel pattern-aided causality analysis approach that combines the strengths of \emph{pattern mining} and \emph{Bayesian learning} to efficiently and faithfully identify the \emph{ST causal pathways}. First, \emph{Pattern mining} helps suppress the noise by capturing frequent evolving patterns (FEPs) of each monitoring sensor, and greatly reduce the complexity by selecting the pattern-matched sensors as "causers". Then, \emph{Bayesian learning} carefully encodes the local and ST causal relations with a Gaussian Bayesian network (GBN)-based graphical model, which also integrates environmental influences to minimize biases in the final results. We evaluate our approach with three real-world data sets containing 982 air quality sensors, in three regions of China from 01-Jun-2013 to 19-Dec-2015. Results show that our approach outperforms the traditional causal structure learning methods in time efficiency, inference accuracy and interpretability.


  • An Extended Spatio-temporal Granger Causality Model for Air Quality Estimation with Heterogeneous

    Zhu, J.Y., Sun, C., and Li, V.O.K., IEEE Transactions on Big Data, to appear.

    This paper deals with city-wide air quality estimation with limited air quality monitoring stations which are geographically sparse. Since air pollution is influenced by urban dynamics (e.g., meteorology and traffic) which are available throughout the city, we can infer the air quality in regions without monitoring stations based on such spatial-temporal (ST) heterogeneous urban big data. However, big data-enabled estimation poses three challenges. The first challenge is data diversity, i.e., there are many different categories of urban data, some of which may be useless for the estimation. To overcome this, we extend Granger causality to the ST space to analyze all the causality relations in a consistent manner. The second challenge is the computational complexity due to processing the massive volume of data. To overcome this, we introduce the non-causality test to rule out urban dynamics that do not “Granger” cause air pollution, and the region of influence (ROI), which enables us to only analyze data with the highest causality levels. The third challenge is to adapt our grid-based algorithm to non-grid-based applications. By developing a flexible grid-based estimation algorithm, we can decrease the inaccuracies due to grid-based algorithm while maintaining computation efficiency.


  • Intelligent Fault Detection Scheme for Microgrids with Wavelet-based Deep Neural Networks

    James J.Q. Yu, Yunhe Hou, Albert Y.S. Lam, and Victor O.K. Li, to appear in IEEE Transactions on Smart Grid, 2017.

    Fault detection is essential in microgrid control and operation, as it enables the system to perform fast fault isolation and recovery. The adoption of inverter-interfaced distributed generation in microgrids makes traditional fault detection schemes inappropriate due to their dependence on significant fault currents. In this paper, we devise an intelligent fault detection scheme for microgrid based on wavelet transform and deep neural networks. The proposed scheme aims to provide fast fault type, phase, and location information for microgrid protection and service recovery. In the scheme, branch current measurements sampled by protective relays are pre-processed by discrete wavelet transform to extract statistical features. Then all available data is input into deep neural networks to develop fault information. Compared with previous work, the proposed scheme can provide significantly better fault type classification accuracy. Moreover, the scheme can also detect the locations of faults, which are unavailable in previous work. To evaluate the performance of the proposed fault detection scheme, we conduct a comprehensive evaluation study on the CERTS microgrid and IEEE 34-bus system. The simulation results demonstrate the efficacy of the proposed scheme in terms of detection accuracy, computation time, and robustness against measurement uncertainty.


  • Delay Aware Intelligent Transient Stability Assessment System

    James J.Q. Yu, Albert Y.S. Lam, David J. Hill, and Victor O.K. Li. IEEE Access, vol. 5, pp. 17230–17239, Dec. 2017.

    Transient stability assessment is a critical tool for power system design and operation. With the emerging advanced synchrophasor measurement techniques, machine learning methods are playing an increasingly important role in power system stability assessment. However, most existing research makes a strong assumption that the measurement data transmission delay is negligible. In this paper, we focus on investigating the influence of communication delay on synchrophasor-based transient stability assessment. In particular, we develop a delay aware intelligent system to address this issue. By utilizing an ensemble of multiple long short-term memory networks, the proposed system can make early assessments to achieve a much shorter response time by utilizing incomplete system variable measurements. Compared with existing work, our system is able to make accurate assessments with a significantly improved efficiency. We perform numerous case studies to demonstrate the superiority of the proposed intelligent system, in which accurate assessments can be developed with time one third less than state-of-the-art methodologies. Moreover, the simulations indicate that noise in the measurements has trivial impact on the assessment performance, demonstrating the robustness of the proposed system.


  • Intelligent Time-Adaptive Transient Stability Assessment System

    James J.Q. Yu, David J. Hill, Albert Y.S. Lam, Jiatao Gu, and Victor O.K. Li. IEEE Transactions on Power Systems, vol. 33, no. 1, pp. 1049–1058, Jan. 2018.

    Online identification of postcontingency transient stability is essential in power system control, as it facilitates the grid operator to decide and coordinate system failure correction control actions. Utilizing machine learning methods with synchrophasor measurements for transient stability assessment has received much attention recently with the gradual deployment of wide-area protection and control systems. In this paper, we develop a transient stability assessment system based on the long short-term memory network. By proposing a temporal self-adaptive scheme, our proposed system aims to balance the trade-off between assessment accuracy and response time, both of which may be crucial in real-world scenarios. Compared with previous work, the most significant enhancement is that our system learns from the temporal data dependencies of the input data, which contributes to better assessment accuracy. In addition, the model structure of our system is relatively less complex, speeding up the model training process. Case studies on three power systems demonstrate the efficacy of the proposed transient stability as sessment system.


  • Delay Aware Power System Synchrophasor Recovery and Prediction Framework

    James J.Q. Yu, Albert Y.S. Lam, David J. Hill, Yunhe Hou, and Victor O.K. Li. IEEE Transactions on Smart Grid, 2018.

    This paper presents a novel delay aware synchrophasor recovery and prediction framework to address the problem of missing power system state variables due to the existence of communication latency. This capability is particularly essential for dynamic power system scenarios where fast remedial control actions are required due to system events or faults. While a wide area measurement system can sample high-frequency system states with phasor measurement units, the control center cannot obtain them in real-time due to latency and data loss. In this work, a synchrophasor recovery and prediction framework and its practical implementation are proposed to recover the current system state and predict the future states utilizing existing incomplete synchrophasor data. The framework establishes an iterative prediction scheme, and the proposed implementation adopts recent machine learning advances in data processing. Simulation results indicate the superior accuracy and speed of the proposed framework, and investigations are made to study its sensitivity to various communication delay patterns for pragmatic applications.


  • Delay Aware Transient Stability Assessment with Synchrophasor Recovery and Prediction Framework

    James J.Q. Yu, David J. Hill, and Albert Y.S. Lam. Neurocomputing, 2018.

    Transient stability assessment is critical for power system operation and control. Existing related research makes a strong assumption that the data transmission time for system variable measurements to arrive at the control center is negligible, which is unrealistic. In this paper, we focus on investigating the impact of data transmission latency on synchrophasor-based transient stability assessment. In particular, we employ a recently proposed methodology named synchrophasor recovery and prediction framework to handle the latency issue and make up missing synchrophasors. Advanced deep learning techniques are adopted to utilize the processed data for assessment. Compared with existing work, our proposed mechanism can make accurate assessments with a significantly faster response speed.


  • Travel Demand Prediction using Deep Multi-Scale Convolutional LSTM Network

    Kai Fung Chu, Albert Y.S. Lam, and Victor O.K. Li. 21st IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2018), Maui, HI, Nov. 2018.

    Mobility on Demand transforms the way people travel in the city and facilitates real-time vehicle hiring services. Given the predicted future travel demand, service providers can coordinate their available vehicles such that they are pre- allocated to the customers’ origins of service in advance to reduce waiting time. Traditional approaches on future travel demand prediction rely on statistical or machine learning methods. Advancement in sensor technology generates huge amount of data, which enables the data-driven intelligent transportation system. In this paper, inspired by deep learning techniques for image and video processing, we propose a new deep learning model, called Multi-Scale Convolutional Long Short-Term Memory (MultiConvLSTM), by considering travel demand as image pixel values. MultiConvLSTM considers both temporal and spatial correlations to predict the future travel demand. Experiments on real-world New York taxi data with around 400 million records are performed. We show that MultiConvLSTM outperforms the existing prediction methods for travel demand prediction and achieves the highest accuracy among all in both one-step and multiple-step predictions.