TY - GEN
T1 - Enhanced search for Arabic language using latent semantic indexing (LSI)
AU - Al-Anzi, Fawaz S.
AU - AbuZeina, Dia
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.
AB - The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.
KW - Arabic text
KW - Dimensionality reduction
KW - Latent semantic indexing
KW - Searche engine
UR - https://www.scopus.com/pages/publications/85061801552
U2 - 10.1109/ICONIC.2018.8601096
DO - 10.1109/ICONIC.2018.8601096
M3 - Conference contribution
AN - SCOPUS:85061801552
T3 - 2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018
BT - 2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference on Intelligent and Innovative Computing Applications, ICONIC 2018
Y2 - 6 December 2018 through 7 December 2018
ER -