java抽取word，pdf格式文件的四种武器

人气：967 2007-11-17

很多人用java进行文档操作时经常会遇到一个问题，就是如何获得word，excel，pdf等文档的内容？我研究了一下，在这里总结一下抽取word,pdf的几种方法。
1. 用jacob
其实jacob是一个bridage，连接java和com或者win32函数的一个中间件，jacob并不能直接抽取word,excel等文件，需要自己写dll哦，不过已经有为你写好的了，就是jacob的作者一并提供了。
jacob jar与dll文件下载： http://danadler.com/jacob/
下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath)，就可以写你自己的抽取程序了，下面是一个简单的例子：
import java.io.file; import com.jacob.com.*; import com.jacob.activex.*; /** * title: pdf extraction * description: email:chris@matrix.org.cn * copyright: matrix copyright (c) 2003 * company: matrix.org.cn * @author chris * @version 1.0,who use this example pls remain the declare */ public class fileextracter{ public static void main(string[] args) { activexcomponent component = new activexcomponent("word.application"); string infile = "c://test.doc"; string tpfile = "c://temp.htm"; string otfile = "c://temp.xml"; boolean flag = false; try { component.setproperty("visible", new variant(false)); object wordacc = component.getproperty("document．").todispatch(); object wordfile = dispatch.invoke(wordacc,"open", dispatch.method, new object[]{infile,new variant(false), new variant(true)}, new int[1] ).todispatch(); dispatch.invoke(wordfile,"saveas", dispatch.method, new object[]{tpfile,new variant(8)}, new int[1]); variant f = new variant(false); dispatch.call(wordfile, "close", f); flag = true; } catch (exception e) { e.printstacktrace(); } finally { component.invoke("quit", new variant[] {}); } } }2. 用apache的poi来抽取word，excel。
poi是apache的一个项目，不过就算用poi你可能都觉得很烦，不过不要紧，这里提供了更加简单的一个接口给你：
下载经过封装后的poi包： http://jakarta.apache.org/poi/
下载之后，放到你的classpath就可以了，下面是如何使用它的一个例子：
import java.io.*; import org.textmining.text.extraction.wordextractor; /** *

title: word extraction


*description: email:chris@matrix.org.cn 

*copyright: matrix copyright (c) 2003 

*company: matrix.org.cn 

* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 

public class pdfextractor { 
public pdfextractor() { 
} 
public static void main(string args[]) throws exception 
{ 
fileinputstream in = new fileinputstream ("c://a.doc"); 
wordextractor extractor = new wordextractor(); 
string str = extractor.extracttext(in); 
system.out.println("the result length is"+str.length()); 
system.out.println("the result is"+str); 
} 
}

3. pdfbox-用来抽取pdf文件
但是pdfbox对中文支持还不好，先下载pdfbox： http://www.pdfbox.org/
下面是一个如何使用pdfbox抽取pdf文件的例子：

import org.pdfbox.pdmodel.pddocument． 
import org.pdfbox.pdfparser.pdfparser; 
import java.io.*; 
import org.pdfbox.util.pdftextstripper; 
import java.util.date; 
/** 
*title: pdf extraction 

*description: email:chris@matrix.org.cn 

*copyright: matrix copyright (c) 2003 

*company: matrix.org.cn 

* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 

public class pdfextracter{ 

public pdfextracter(){ 
} 
public string gettextfrompdf(string filename) throws exception 
{ 
string temp=null; 
pddocument．nbsppdfdocument．null; 
fileinputstream is=new fileinputstream(filename); 
pdfparser parser = new pdfparser( is ); 
parser.parse(); 
pdfdocument．nbsp= parser.getpddocument．); 
bytearrayoutputstream out = new bytearrayoutputstream(); 
outputstreamwriter writer = new outputstreamwriter( out ); 
pdftextstripper stripper = new pdftextstripper(); 
stripper.writetext(pdfdocument．getdocument．), writer ); 
writer.close(); 
byte[] contents = out.tobytearray(); 

string ts=new string(contents); 
system.out.println("the string length is"+contents.length+"/n"); 
return ts; 
} 
public static void main(string args[]) 
{ 
pdfextracter pf=new pdfextracter(); 
pddocument．nbsppdfdocument．nbsp= null; 

try{ 
string ts=pf.gettextfrompdf("c://a.pdf"); 
system.out.println(ts); 
} 
catch(exception e) 
{ 
e.printstacktrace(); 
} 
} 

}

4. 抽取支持中文的pdf文件－xpdf
xpdf是一个开源项目，我们可以调用他的本地方法来实现抽取中文pdf文件。
下载xpdf函数包： http://www.foolabs.com/xpdf/
同时需要下载支持中文的补丁包，按照readme放好中文的patch，就可以开始写调用本地方法的java程序了。
下面是一个如何调用的例子：

import java.io.*; 
/** 
*title: pdf extraction 

*description: email:chris@matrix.org.cn 

*copyright: matrix copyright (c) 2003 

*company: matrix.org.cn 

* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 


public class pdfwin { 
public pdfwin() { 
} 
public static void main(string args[]) throws exception 
{ 
string path_to_xpdf="c://program files//xpdf//pdftotext.exe"; 
string filename="c://a.pdf"; 
string[] cmd = new string[] { path_to_xpdf, "-enc", "utf-8", "-q", filename, "-"}; 
process p = runtime.getruntime().exec(cmd); 
bufferedinputstream bis = new bufferedinputstream(p.getinputstream()); 
inputstreamreader reader = new inputstreamreader(bis, "utf-8"); 
stringwriter out = new stringwriter(); 
char [] buf = new char[10000]; 
int len; 
while((len = reader.read(buf))>= 0) { 
//out.write(buf, 0, len); 
system.out.println("the length is"+len); 
} 
reader.close(); 
string ts=new string(buf); 
system.out.println("the str is"+ts); 
} 
}

技术文档欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进

java抽取word，pdf格式文件的四种武器

https访问

7*24小时服务

专业一线支持

7天无理由退款

关于我们

产品与服务

常见问题

技术支持

欢迎登录福佳jsp空间

技术文档 欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进

java抽取word，pdf格式文件的四种武器

https访问

7*24小时服务

专业一线支持

7天无理由退款

关于我们

产品与服务

常见问题

技术支持

技术文档欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进