As I mentioned in my last post, I was trying to access the Minipar library from Java. Our current approach which uses the pdemo program to parse each sentence has a performance problem. It took more than an hour to parse 28 research articles (about 265ms per sentence). Most of time may be spent on creating the pdemo process and loading the data files, which were done before parsing every sentence. So, I wrote a Java proxy class which calls a C++ proxy class which then calls the Minipar library. The initialization code is only called once at the beginning and there is no need to create a process. The Java code calls the C++ library through the Java Native Interface (JNI). The illustration below shows the basic process of a call of parsing a sentence.
Main Java program -> MiniparProxy.java -> MiniparProxy.cpp -> Minipar library
The improvement on performance is significant. See the table below for a comparison. 28 research articles are processed (17558 sentences, 341980 word tokens). The current method is about 20 times faster than the original one.
|New Minipar2.java||Original Minipar.java|
|Total time (min)||3.90||77.77|
|Time per document (s)||8.37||166.66|
|Time per sentence (ms)||13.34||265.77|
|Time per token (ms)||0.68||