Latent Semantic Analysis Based Automatic Cross-Language Plagiarism Detector for Paragraph Written in Two Syntactically Distinct Languages

Abstract

The number of scientific publication in Bahasa Indonesia is now in steady rise. As a speaker of under-resourced language, Indonesian author often consult documentation in other language, especially English. The necessity for an automated cross-language plagiarism checker has now become prominent. There are several methods available for an automated cross-language plagiarism detection but, most of them only works well on syntactically similar language. Unfortunately both Bahasa Indonesia and English come from a very different language family, therefore they have completely different syntax. This paper investigates the possibility of expanding the use of Latent Semantic Analysis (LSA) for an automated cross-language plagiarism checker between two syntactically distinct languages. LSA's bag of word concept is exploited, removing the necessity to use grammatically correct automatic translator. Several modifications to the LSA algorithm are also proposed to improve its performance. The proposed a proof of concept algorithm is capable to find similarities between a paragraph and its exact translation written in different languages. The exact translation of a paragraph can be identified with 81.82% up to 90.91% accuracy in all test cases.



Author Information
Anak Agung Putri Ratna, Universitas Indonesia, Indonesia
Emily Lomempow, Universitas Indonesia, Indonesia
Prima Dewi Purnamasari, Universitas Indonesia, Indonesia
Untung Yuwono, Universitas Indonesia, Indonesia
Boma Anantasatya Adhi, Universitas Indonesia, Indonesia

Paper Information
Conference: ACSET2015
Stream: Education and Technology: Teaching

This paper is part of the ACSET2015 Conference Proceedings (View)
Full Paper
View / Download the full paper in a new tab/window

Posted by amp21