Duplicate Files in Local File System (File Based Index) - Custom Language plugin

Answered

Hi, I have a requirement in my plugin to check for duplicates files in my location file system. As I'm new to plugin Development, following were my steps to accomplish the functionality. 

1. Access the file system using

VirtualFile rootDir = LocalFileSystem.getInstance().findFileByIoFile(new File(basePath));

2. The `basePath` has a multiple sub-directories with .csv files in them in addition to .git folder (as the folder is version controlled)

3. Used the recursively traversing the structure to fetch the files

VfsUtilCore.iterateChildrenRecursively(rootDir, null, new ContentIterator() {
@Override
public boolean processFile(@NotNull VirtualFile fileOrDir) {
if (!(fileOrDir.isDirectory() && fileOrDir.getName().equals(".git"))) {
if (!fileOrDir.isDirectory() && fileOrDir.getName().endsWith(".csv")) {
try {
System.out.println(fileOrDir.getName());
checkingOnDuplicates.putValue(new String(fileOrDir.contentsToByteArray()),
fileOrDir);
} catch (IOException ioException) {
ioException.printStackTrace();
}
}
}
return true;
}
});

4. checkingOnDuplicates is a MultiMap which helps check for duplicates. 

5. To help with the learning, I've used editor popup menu for a validation to trigger the duplicate search. 

The complete code is in https://github.com/venrad/intellij-plugin-tutorials

The above solution had two major problems. 

1. The program hung for a long time before it failed with out of memory

2. As the files are searched recursively, I think this would be inefficient. 

I came across FileBasedIndexing which may be a better solution to help with my requirement. Unfortunately, I'm not able to get hold of a quick example which could help me understand the implementation of FileBasedIndex. Could someone help?

 

I've visited the http://confluence.jetbrains.net/display/IDEADEV/Indexing+and+PSI+Stubs+in+IntelliJ+IDEA and the documentation is good but I'm struggling with the implementation. 

1 comment
Comment actions Permalink

I'm not sure what "duplicates" means here exactly. If you consider filename, then you can use existing FilenameIndex to quickly locate all files with specific name https://jetbrains.org/intellij/sdk/docs/basics/indexing_and_psi_stubs/file_based_indexes.html#file-name-index

 

If you want to index by contents of file, you can provide your custom com.intellij.util.indexing.FileBasedIndexExtension indexing all .csv files and storing calculated hash of file contents.

0

Please sign in to leave a comment.