[ANN] Duplicate code detection plug-in (SimScan beta 2)

SimScan is a tool for detection of duplicate or similar fragments in Java source code. It operates on the parsed source tree and disregards differences like renamed variables or different code formatting or comments. It will also find non-exact matches where a portion of code is copy-pasted and then the code itself is modified.


Download from:
http://blue-edge.bg/simscan

7 comments

Hi,

There is another similar plugin:
http://www.intellij.org/twiki/bin/view/Main/SamePlugin

What are the differences between Same and SimScan?

Marius

Nikola Toshev wrote:

SimScan is a tool for detection of duplicate or similar fragments in Java source code. It operates on the parsed source tree and disregards differences like renamed variables or different code formatting or comments. It will also find non-exact matches where a portion of code is copy-pasted and then the code itself is modified.


Download from:
http://blue-edge.bg/simscan


0

SimScan will find more matches that are relevant, even if you have small differences between the matches. Same gets only exact matches (I think they ignore only whitespace).

With appropriate size threshold, Simscan will actually find that this code:

public int exec(Prog p) {
Component C=(Component)(((JavaObject)getArg(0)).toObject());
C.removeNotify();
return 1;
}

is similar to this code:

public int exec(Prog p) {
Container C=(Container)(((JavaObject)getArg(0)).toObject());
C.removeAll();
return 1;
}

is similar to this code:
public int exec(Prog p) {
JinniText T=(JinniText)(((JavaObject)getArg(0)).toObject());
T.setText("");
return 1;
}

Same will find none of these. On the other hand, Same is much faster than SimScan. It's a speed/quality trade-off.

SimScan has also better integration with IDEA - it selects the similar code directly in the editor of IDEA, allowing immediate refactoring. Same plug-in opens a separate window with portion of the code, without the usual syntax highlighting, etc. SimScan has similar plug-ins for Eclipse and JBuilder as well.

0

Hanged up at 35% done on a relatively small package. Thread dump attached:
Full thread dump Java HotSpot(TM) Client VM (1.4.2-beta-b19 mixed mode):

at java.util.HashMap.(HashMap.java:200) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST$2.compare(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.FuzzyAST.a(Unknown Source) at bg.blue_edge.simscan.g.a(Unknown Source) at bg.blue_edge.simscan.g.a(Unknown Source) at bg.blue_edge.simscan.plugin.common.u$a.run(Unknown Source) -- Best regards, Maxim Shafirov JetBrains, Inc / IntelliJ Software http://www.intellij.com "Develop with pleasure!" "Nikola Toshev" ]]> wrote in message
news:6676853.1050762407014.JavaMail.jrun@is.intellij.net...

SimScan is a tool for detection of duplicate or similar fragments in Java

source code. It operates on the parsed source tree and disregards
differences like renamed variables or different code formatting or comments.
It will also find non-exact matches where a portion of code is copy-pasted
and then the code itself is modified.
>
>

Download from:
http://blue-edge.bg/simscan



0

SimScan is much nicer feature wise but it is very slow.

Another problem is the licensing. I cannot use it in a
commercial environment (and this is where IDEA is mostly
used I would guess) :(

Marius

Nikola Toshev wrote:

SimScan will find more matches that are relevant, even if you have small differences between the matches. Same gets only exact matches (I think they ignore only whitespace).

With appropriate size threshold, Simscan will actually find that this code:

public int exec(Prog p) {
Component C=(Component)(((JavaObject)getArg(0)).toObject());
C.removeNotify();
return 1;
}

is similar to this code:

public int exec(Prog p) {
Container C=(Container)(((JavaObject)getArg(0)).toObject());
C.removeAll();
return 1;
}

is similar to this code:
public int exec(Prog p) {
JinniText T=(JinniText)(((JavaObject)getArg(0)).toObject());
T.setText("");
return 1;
}

Same will find none of these. On the other hand, Same is much faster than SimScan. It's a speed/quality trade-off.

SimScan has also better integration with IDEA - it selects the similar code directly in the editor of IDEA, allowing immediate refactoring. Same plug-in opens a separate window with portion of the code, without the usual syntax highlighting, etc. SimScan has similar plug-ins for Eclipse and JBuilder as well.


0

We just had a follow up release (beta 3) that mostly solves the speed problem by allowing an order of magnitude faster searches with very little degradation of quality. You can select different levels of trade-off. However, Same is still faster.

I believe the "hanged up on 35%" problem was in fact searching through a difficult (and slower) subspace. Another possibility is a low memory condition, but this is unlikely if the package is small. On packages of hundreds KLOCS you'd need above 512 megabytes maximum limit for the JVM / IDE RAM. Please let me know if the problem persists.

The licence allows you to use SimScan in commercial setting if you are going to release your source under an open source licence. For other uses, feel free to contact me.

0

What about the difference with CloneFinder?
What is the problem with licensing? Are you not going to release a commercial product in the end that one can purchase?

Jacques

0

Clone Finder works on the parsed tree too. However it seems to miss A LOT of relevant matches. Clone finder is much faster.

I suspect Clone Finder tries to find exact matches and then allows some little deviations. SimScan can be tuned for every kind of matches as long as they are recognized as useful, we basically have a metrics for the usefulness of the clones. Speed is good, but should not be a big problem as SimScan would not be executed often.

We don't feel that SimScan is tuned well enough for a commercial code yet, that's why we ofer only the free license for now. A license will be available for commercial use, but development of programming tools is not the primary focus of the company and this is unlikely to happen in the near two months.

0

Please sign in to leave a comment.