Enhanced search for Arabic language using latent semantic indexing (LSI)

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.

Original languageEnglish
Title of host publication2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538664773
DOIs
StatePublished - 2 Jul 2018
Event2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018 - Plaine Magnien, Mauritius
Duration: 6 Dec 20187 Dec 2018

Publication series

Name2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018

Conference

Conference2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018
Country/TerritoryMauritius
CityPlaine Magnien
Period6/12/187/12/18

Keywords

  • Arabic text
  • Dimensionality reduction
  • Latent semantic indexing
  • Searche engine

Funding Agency

  • Kuwait Foundation for the Advancement of Sciences

Fingerprint

Dive into the research topics of 'Enhanced search for Arabic language using latent semantic indexing (LSI)'. Together they form a unique fingerprint.

Cite this