INTRODUCTION
Examinations have always been a traditional way to judge a student's academic expertise. We all have undergone strict invigilation during our exams. However, cases of cheating have always been there as a part and parcel of examinations. No matter how strict the system be, students somehow manage to copy. Many a times it has been observed even non deserving students get good ranks given to cheating. It is very much essential to curb plagiarism and let only deserving students reach the heights of success.
Source code plagiarism is very common among students studying in schools, colleges, universities that deal with programming languages. Reasons for copying can be many. Say, for example, lack of study, poor performance, pressure from parents, subject not understood. With the advent of internet and digitization, any kind of material related to any domain is just a click away, so are the chances of plagiarism. Before getting into the details of source code plagiarism let me give a formal definition of the word Plagiarism that we have been using again and again. Also we have discussed types of source code plagiarism to get a clearer view. According to the Merriam- Webster Online Dictionary, to "plagiarize" means
- to steal and pass off (the ideas or words of another) as one's own
- to use (another's production) without crediting the source
- to commit literary theft.
Types of Source Code Plagiarism:
Textual Similarity: Two codes are said to be textually similar if the words, variables in the source codes are similar.
- Type 1: This type has one code that is a copy of another code except changes in indentation, line spacing.
- Type 2: Same as type 1 except for modifications to variable names, function names.
- Type 3: Some lines are added to or removed from the code which has been copied, which does not hold any meaning.
Functional Similarity: Two codes are said to be functionally similar if they are using the same semantics or performing the same action. Due to limited number of teachers, as the number of students increase it becomes more difficult and time consuming for the teachers to manually detect plagiarism. The manual evaluation many–a–times tends to be shallow as the teacher usually just performs a scan of the code, and does not check all aspects. Given to the increasing number of students, comparing each source code file with all others to find plagiarism is a very cumbersome task. Say, if we have around 200 students undergoing a programming test, practically it’s not possible for a teacher to compare which student copied which part of the program from the other, manually. Even after strict manual checking possibility of traces of plagiarism still persists. To let take teachers a sigh of relief we have come up with a tool for automatically detecting source code plagiarism, “Parikshak”.
There are mainly two approaches used in plagiarism detection tools:
1. Attribute Based
2. Structure Based
In attribute based approach the focus is on certain properties of the program like number of lines, number of words, and number of characters. Two files with same sequence of these properties are considered a candidate for plagiarism.
In structure based approach the focus is on structure of the program. The program is first tokenized and then tokens are compared using Greedy String Tiling algorithm.
The whole focus of this paper from now onwards would be on detecting source code plagiarism as it
is the requirement of the project “Parikshak”. Below in brief the design and implementation of our tool is discussed. Performance analysis has also been done on our tool that gives a better idea about the efficiency of our tool and for which all languages it suits the best.
is the requirement of the project “Parikshak”. Below in brief the design and implementation of our tool is discussed. Performance analysis has also been done on our tool that gives a better idea about the efficiency of our tool and for which all languages it suits the best.
PROBLEM STATEMENT
Source code plagiarism is very common among students studying in schools, colleges, universities that deal with programming languages. Reasons for copying can be many. Say, for example, lack of study, poor performance, pressure from parents, subject not understood. With the advent of internet and digitalization, any kind of material related to any domain is just a click away, so are the chances of plagiarism. Before getting into the details of source code plagiarism let me give a formal definition of the word Plagiarism that we have been using again and again. Also we have discussed types of source code plagiarism to get a clearer view. Examinations have always been a traditional way to judge a student's academic expertise. We all have undergone strict invigilation during our exams. However, cases of cheating have always been there as a part and parcel of examinations. No matter how strict the system be, students somehow manage to copy. Many a times it has been observed even non deserving students get good ranks given to cheating. It is very much essential to curb plagiarism and let only deserving students reach the heights of success According to the Merriam-Webster Online Dictionary, to "plagiarize" means
- To steal and pass off (the ideas or words of another) as one's own
- To use (another's production) without crediting the source
- To commit literary theft
Existing systems and their Disadvantages
JPlag was developed by Guido Malpohl at the University of Karlsruhe. It converts the program's source code into tokens strings that represent the structure of the program, and can therefore be considered as using a structure-based approach and applies “Greedy String Tiling” algorithm as proposed by Michael Wise, but with different optimizations for better efficiency. JPlag supports Java, C#, C, C++, Scheme and natural language text. JPlag presents its results as a set of HTML pages. The pages are sent back to the client and stored locally.
Moss was developed in 1994 at Stanford University by Aiken et al. It is being provided as a web service that can be accessed using a script obtained from the moss website. Moss is an acronym for Measure of Software Similarity. A moss account (and submission script) can be Plagiarism Detection Tool “Parikshak” obtained by e-mail from moss@moss.stanford.edu. The moss submission script works for Unix/Linux platforms and may work under Windows with Cygwin, but the latter is untested. To measure similarity between moss uses a document fingerprinting algorithm called winnowin. MOSS can currently analyse code written in the following languages: C, C++, Java, C#, Python, Visual Basic, JavaScript, FORTRAN, ML, Haskell, Lisp, Scheme, Pascal, Modula2, Ada, Perl, TCL, Matlab, VHDL, Verilog, Spice, MIPS assembly, a8086 assembly, HCL2.
Plaggie is a source code plagiarism detection engine meant for Java programming exercises. In appearance and functionality, it is similar to JPlag, but there are also aspects of Plaggie that makes it very different from JPlag: Plaggie must be installed locally and its source code is open. Plaggie was developed in 2002 by Ahtiainen et al. at Helsinki University of Technology. It is a stand-alone command line Java application. The basic algorithm used for comparing two source code files is same as for JPlag: tokenization followed by Greedy String Tiling.
Plaggie supports Java 1.5 and above. It only supports java. There is no scope of adding other languages. Marble is a plagiarism detection tool for Java programs. It is used at the Department of Information and Computer Sciences at Utrecht University, to assist lecturers in the detection of Plagiarism in programming assignments. Marble uses a structure-based approach to compare the submissions. Marble supports Java. Support for Perl, PHP and XSLT is experimental. The results are outputted to a script named either suspects.nf (unsorted) or suspects.nfs (sorted), which, when run, outputs for each pair that exceeds a given threshold of similarity, the similarity score, the size of both files, and then opens both original files in a diff editor to show the differences.
CPD Copy/Paste Detector (CPD) is an add-on to PMD that uses the Rabin–Karp string search algorithm to find duplicate code. CPD works with Java, JSP, C, C++, FORTRAN, PHP, and C# code. It's bundled with PMD, and you can also run it via Java Web Start. The CPD follows the following steps:
- Tokenization of the code.
- Builds an occurrence table based on tokenization.
- Looks for duplicate code from the occurrence table.
It supports java, c, c++, fortran, php, c#. Results are presented in a html format. No exclusion of template code and small files so chances of resemblance of code increases even if some code is allowed to be same for all students.
CodeMatch, the CodeMatch was developed by Bob, a senior member of the IEEE and president of Zeidman Consulting, a contract research and development firm. CodeMatch makes use of knowledge of programming languages and program structures to improve the matching results. CodeMatch uses a combination of five algorithms to find plagiarism: Source Line Matching, Comment Line Matching, Word Matching, Partial Word Matching, and Semantic Sequence Matching. Each algorithm is useful in finding different clues to plagiarism that the other algorithms may miss. By using all five algorithms, chances of missing plagiarized code is significantly diminished. Before any of the algorithm processing takes place, some pre-processing is done to create string arrays. It supports BASIC, C, C++, C#, Delphi, Flash ActionScript, Java, JavaScript, MASM, Pascal, Perl, PHP, PowerBuilder, Ruby, SQL, Verilog and VHDL.
Proposed System and its Advantages
We have used Structure based approach, like many other Plagiarism detection tools discussed above, wherein a bunch of files or directories needs the source code to be first of all tokenized. The process of Tokenization is explained briefly in design and implementation part of this paper. It’s very essential for any plagiarism detector tool to be extendible, as in any course, learning one language is not sufficient considering the competition and changing technologies. So Plagiarism detector should be able to add a new language for plagiarism detection as and when required. We have incorporated this feature in our tool. New languages can be added easily without significant changes in code, so making our tool extendible. Right now, our tool is supporting Java language. Our tool offers extendibility; you just need to introduce the set of keywords of the language you want to add. Reach ability is another important aspect for any tool, as if it’s not reachable then its usability will also be less. Our tool is deployed on web, as a “Parikshak” so reach ability is not an issue. Teachers can log in, at any time, from any terminal which has internet on it. The terminals with minimum configuration can also get the results, as output is just an html page. No extra installation of any software is required.. The interface is also very user friendly, teacher just has to select the files, using check boxes, which he/she thinks are plagiarised and can see the results within a minute or so. The output is a list of files with a percentage of matching code in them. Also there is a button to see matching code. As teacher clicks on the button, the two files with matching codes are shown side by side. The matching codes are represented with different colours. A click on one side of matching code brings the similar code on other side in focus. A snap shot of the output is shown in figure 3 below. This provides ease in comparing the results. The teacher doesn't have to wait much to get the results, our tool offers a great response time, within one minute the results can be seen for a bunch of source code files. That saves time and frustration, of the teacher, and helps to find the faulty students.
Advantages of Proposed system
- Structured based approach
- Extendibility
- Reachability (Deployed on Web)
- No need to install any extra software
- Interface will be very easier
SYSTEM REQUIREMENTS SPECIFICATIONS
A system requirements specification (SRS) is a structured collection of information that embodies the requirements of a system. The system requirements specification document enlists all necessary requirements that are required for the project development. To derive the requirements we need to have clear and thorough understanding of the products to be developed. The SRS may be one of a contract deliverable Data Item Description or have other form of organizationally-mandated content.
Hardware requirements
- Processor : Dual core processor or above
- Speed : 1.6 GHz or above
- Hard disk : 40 GB or above
- RAM : 2 GB or above
Software Requirements
- Operating system : Windows XP/7
- Front end : HTML5, CSS3, Skeleton, Foundation, Jquery, Ajax
- Back end : Java, JEE (Servlet, JDBC, JSP)
- Web server : Apache Tomcat
- Database : MySQL/Oracle
- Build tools : Maven
- Other tools : Eclipse IDE, 7zip, Winscp
SYSTEM ANALYSIS AND DESIGN
System Architecture
The architectural design of our tool consists of a web interface, where files to be checked for plagiarism are selected and results are displayed. Then comes Tokenization and Comparison. Tokenization is language dependent phase and comparison is independent. Tokenizer handles on its own parsing of different language source codes. Tokenized source code files are compared for plagiarism in comparator. Fig. shows the architecture of Parikshak's Plagiarism detector. It has three modules. Web-Interface is what the user sees when he gives a request for detecting plagiarism. The request gets forwarded to Tokenizer which tokenizes the source codes. Then the output of Tokenizer is fed to comparator and it’s here the source codes are analyzed for similarities. Both Tokenization and Comparison are explained briefly in subsequent sections.
Modules
Module 1: Repository management portal development – Account Access Operations
Module 2: Repository management portal development – Repository operations
Module 3: Developing user interface for Source code plagiarism detector
Module 4: Developing the source file comparator component for source code plagiarism detector
Module 5: Developing the similarity percentage calculator and Result builder for source code plagiarism detector
Module 6: Developing user interface for Text files plagiarism detector
Module 7: Developing the text files comparator component for source code plagiarism detector
Module 8: Developing the similarity percentage calculator and Result builder for text files plagiarism detector
Dataflow Diagram
A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system, modelling its process aspects. A DFD is often used as a preliminary step to create an overview of the system, which can later be elaborated.
Sequence Diagram
A Sequence diagram is an interaction diagram that shows how processes operate with one another and in what order. It is a construct of a Message Sequence Chart. A sequence diagram shows object interactions arranged in time sequence.
Class Diagram
Fig. shows the class diagram of Parikshak's Plagiarism detector. Language is the class responsible for Tokenization of source codes and Compare is responsible for finding the similarities between source codes. C, php, Java, Perl, Python, C++ are the classes for which Parikshak supports Plagiarism detection.
Algorithm
The tool PARIKSHAK is based on Greedy String Tiling Algorithm. Greedy String Tiling (GST), which involves tiling one string with matching substrings of a second string. Hence the matching is possible with an efficient and easy manner.
1. search-length s := initial-search-length
2. stop := false
3. Repeat
4. L max := scanpattern(s)
/* L max is the size of the largest maximal-matches found in this iteration */
5. if L max > 2 × s then s := L max
/* Very long string; don’t mark tiles but try again with larger s*/
6. else
7. markstrings(s)
/* Create tiles from matches takes from list of queues */
8. if s > 2× minimum_match_length then s := s div 2
9. else if s > minimum_match_length then s := minimum_match_length
10. else stop := true
11. until stop