| |
很多人用java进行文档操作时经常会遇到一个问题,就是如何获得word,excel,pdf等文档的内容?我研究了一下,在这里总结一下抽取word,pdf的几种方法。 1. 用jacob 其实jacob是一个bridage,连接java和com或者win32函数的一个中间件,jacob并不能直接抽取word,excel等文件,需要自己写dll哦,不过已经有为你写好的了,就是jacob的作者一并提供了。 jacob jar与dll文件下载: http://danadler.com/jacob/ 下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath),就可以写你自己的抽取程序了,下面是一个简单的例子:
import java.io.file; import com.jacob.com.*; import com.jacob.activex.*; /** * title: pdf extraction * description: email:chris@matrix.org.cn * copyright: matrix copyright (c) 2003 * company: matrix.org.cn * @author chris * @version 1.0,who use this example pls remain the declare */ public class fileextracter{ public static void main(string[] args) { activexcomponent component = new activexcomponent("word.application"); string infile = "c://test.doc"; string tpfile = "c://temp.htm"; string otfile = "c://temp.xml"; boolean flag = false; try { component.setproperty("visible", new variant(false)); object wordacc = component.getproperty("document.").todispatch(); object wordfile = dispatch.invoke(wordacc,"open", dispatch.method, new object[]{infile,new variant(false), new variant(true)}, new int[1] ).todispatch(); dispatch.invoke(wordfile,"saveas", dispatch.method, new object[]{tpfile,new variant(8)}, new int[1]); variant f = new variant(false); dispatch.call(wordfile, "close", f); flag = true; } catch (exception e) { e.printstacktrace(); } finally { component.invoke("quit", new variant[] {}); } } }
2. 用apache的poi来抽取word,excel。 poi是apache的一个项目,不过就算用poi你可能都觉得很烦,不过不要紧,这里提供了更加简单的一个接口给你: 下载经过封装后的poi包: http://jakarta.apache.org/poi/ 下载之后,放到你的classpath就可以了,下面是如何使用它的一个例子:
import java.io.*; import org.textmining.text.extraction.wordextractor; /** *title: word extraction *description: email:chris@matrix.org.cn *copyright: matrix copyright (c) 2003 *company: matrix.org.cn * @author chris * @version 1.0,who use this example pls remain the declare */
public class pdfextractor { public pdfextractor() { } public static void main(string args[]) throws exception { fileinputstream in = new fileinputstream ("c://a.doc"); wordextractor extractor = new wordextractor(); string str = extractor.extracttext(in); system.out.println("the result length is"+str.length()); system.out.println("the result is"+str); } }
3. pdfbox-用来抽取pdf文件 但是pdfbox对中文支持还不好,先下载pdfbox: http://www.pdfbox.org/ 下面是一个如何使用pdfbox抽取pdf文件的例子:
import org.pdfbox.pdmodel.pddocument. import org.pdfbox.pdfparser.pdfparser; import java.io.*; import org.pdfbox.util.pdftextstripper; import java.util.date; /** *title: pdf extraction *description: email:chris@matrix.org.cn *copyright: matrix copyright (c) 2003 *company: matrix.org.cn * @author chris * @version 1.0,who use this example pls remain the declare */
public class pdfextracter{
public pdfextracter(){ } public string gettextfrompdf(string filename) throws exception { string temp=null; pddocument.nbsppdfdocument.null; fileinputstream is=new fileinputstream(filename); pdfparser parser = new pdfparser( is ); parser.parse(); pdfdocument.nbsp= parser.getpddocument.); bytearrayoutputstream out = new bytearrayoutputstream(); outputstreamwriter writer = new outputstreamwriter( out ); pdftextstripper stripper = new pdftextstripper(); stripper.writetext(pdfdocument.getdocument.), writer ); writer.close(); byte[] contents = out.tobytearray();
string ts=new string(contents); system.out.println("the string length is"+contents.length+"/n"); return ts; } public static void main(string args[]) { pdfextracter pf=new pdfextracter(); pddocument.nbsppdfdocument.nbsp= null;
try{ string ts=pf.gettextfrompdf("c://a.pdf"); system.out.println(ts); } catch(exception e) { e.printstacktrace(); } }
}
4. 抽取支持中文的pdf文件-xpdf xpdf是一个开源项目,我们可以调用他的本地方法来实现抽取中文pdf文件。 下载xpdf函数包: http://www.foolabs.com/xpdf/ 同时需要下载支持中文的补丁包,按照readme放好中文的patch,就可以开始写调用本地方法的java程序了。 下面是一个如何调用的例子:
import java.io.*; /** *title: pdf extraction *description: email:chris@matrix.org.cn *copyright: matrix copyright (c) 2003 *company: matrix.org.cn * @author chris * @version 1.0,who use this example pls remain the declare */
public class pdfwin { public pdfwin() { } public static void main(string args[]) throws exception { string path_to_xpdf="c://program files//xpdf//pdftotext.exe"; string filename="c://a.pdf"; string[] cmd = new string[] { path_to_xpdf, "-enc", "utf-8", "-q", filename, "-"}; process p = runtime.getruntime().exec(cmd); bufferedinputstream bis = new bufferedinputstream(p.getinputstream()); inputstreamreader reader = new inputstreamreader(bis, "utf-8"); stringwriter out = new stringwriter(); char [] buf = new char[10000]; int len; while((len = reader.read(buf))>= 0) { //out.write(buf, 0, len); system.out.println("the length is"+len); } reader.close(); string ts=new string(buf); system.out.println("the str is"+ts); } }
|
|