Muscat: Pioneering techniques designed to detect whether Arabic text is being reused online – and help to identify plagiarism – have been developed by a graduate student in the UAE.
A thesis produced by Leena Mahmoud Ahmed Lulu, who is studying for a Ph.D. in Philosophy and conducted her research within the United Arab Emirates University College of Information Technology, has outlined how a new method based on document-fingerprinting can discover whether original Arabic content on the internet is being used again by others.
She conducted the research after saying little or no work had been carried out on discovering instances of text reuse and plagiarism in the Arabic language.
Her research paper, which has now been published, also proposes a new web search tool to accompany the detection method, allowing lengthier queries to be entered when trying to assess if content is entirely original or has been used before.
“While the local text reuse [meaning only a small part of a document is coped and modified] detection problem has been mostly studied for Western languages, it is still one of the biggest challenges in the Arabic language and the research has remained quite limited. The results of this research can be thought of as rich tools for information analysts, to validate and assess information coming from uncertain sources,” said Lulu in her thesis.
“It is also time for Web users to become ‘fact inspectors’, by providing them with a tool that allows people to quickly check the validity and originality of statements and sources.”
Lulu’s research paper explained that the most widely-used and effective approach is the detection of documents which share one or more “fingerprints”, a reliable indicator that they share some reused text.
However, it also pointed out that the linkage between Arabic letters, the right-to-left writing direction of Arabic text, and the flexibility of its word order, reduces the efficiency of such techniques — a problem the new fingerprinting model developed through her research, tailored for the Arabic language, aims to solve.
“Our proposed method proved to be more robust for detecting text reuse, particularly when the sentence length increases toward the average sentence length in the Arabic language,” Lulu said.
“The system first creates an initial documents collection obtained from the Web, then applies the detection techniques for finding text reuse with a given input document from this collection.”
Lulu said possible future research could focus on areas including the development of new approaches that would allow the document fingerprints to be more targeted and specific, enhancing the effectiveness of the method.
She also suggested a thesaurus of “paraphrased” Arabic sayings, which might otherwise go undetected through a text reuse search, could be compiled.