TY - GEN
T1 - A micro-word based approach for Arabic sentiment analysis
AU - Al-Anzi, Fawaz S.
AU - Abuzeina, Dia
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Sentiment analysis of social networks data has recently received a great deal of attention. Social networks are characterized by uncommon language that is different when compared with the standard format of the language. Hence, there is a demand for effective methods to analyze the huge volume of the new word variants that quickly and daily show up in the digital and online world. In text classification, vector space model (VSM) is based on the vocabulary list (i.e. the entire training set words) while ignoring the odd words, which leads to partial loss of textual information. To address this challenge, we propose to use each two-neighboring letters of the word as a basic feature unit instead of using the word itself. That is, instead of using words in VSM, we propose a new method that is based on decomposing each word into a sequence of micro-words, each of which has only two consecutive letters. Two data collections were employed to investigate the performance. The data collections include common (i.e. standard form) and uncommon Arabic text (obtained from Instagram). For the common text, we used a corpus that contains 1,500 documents for training and 500 documents for testing. The proposed method was evaluated using latent semantic indexing (LSI) for textual features and cosine similarity measure for classification. The experimental results show promising results as the proposed method correctly classifies the testing set documents with an accuracy up to 83.6%.
AB - Sentiment analysis of social networks data has recently received a great deal of attention. Social networks are characterized by uncommon language that is different when compared with the standard format of the language. Hence, there is a demand for effective methods to analyze the huge volume of the new word variants that quickly and daily show up in the digital and online world. In text classification, vector space model (VSM) is based on the vocabulary list (i.e. the entire training set words) while ignoring the odd words, which leads to partial loss of textual information. To address this challenge, we propose to use each two-neighboring letters of the word as a basic feature unit instead of using the word itself. That is, instead of using words in VSM, we propose a new method that is based on decomposing each word into a sequence of micro-words, each of which has only two consecutive letters. Two data collections were employed to investigate the performance. The data collections include common (i.e. standard form) and uncommon Arabic text (obtained from Instagram). For the common text, we used a corpus that contains 1,500 documents for training and 500 documents for testing. The proposed method was evaluated using latent semantic indexing (LSI) for textual features and cosine similarity measure for classification. The experimental results show promising results as the proposed method correctly classifies the testing set documents with an accuracy up to 83.6%.
KW - Arabic text
KW - Classification
KW - Cosine similarity measure. Latent semantic indexing
KW - Sentiment analysis
KW - Social networks
UR - http://www.scopus.com/inward/record.url?scp=85046076846&partnerID=8YFLogxK
U2 - 10.1109/AICCSA.2017.177
DO - 10.1109/AICCSA.2017.177
M3 - Conference contribution
AN - SCOPUS:85046076846
T3 - Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
SP - 910
EP - 914
BT - Proceedings - 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications, AICCSA 2017
PB - IEEE Computer Society
T2 - 14th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2017
Y2 - 30 October 2017 through 3 November 2017
ER -