java抽取word和pdf格式文件的四种武器

人气：1206 2007-11-16

很多人用java进行文档操作时经常会遇到一个问题，就是如何获得word，excel，pdf等文档的内容？我研究了一下，在这里总结一下抽取word,pdf的几种方法。

1. 用jacob

其实jacob是一个bridage，连接java和com或者win32函数的一个中间件，jacob并不能直接抽取word,excel等文件，需要自己写dll哦，不过已经有为你写好的了，就是jacob的作者一并提供了。

jacob jar与dll文件下载： http://danadler.com/jacob/

下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath)，就可以写你自己的抽取程序了，下面是一个简单的例子：

import java.io.file; 
import com.jacob.com.*; 
import com.jacob.activex.*; 
/** 
* title: pdf extraction 
* description: email:chris@matrix.org.cn 
* copyright: matrix copyright (c) 2003 
* company: matrix.org.cn 
* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 
public class fileextracter{ 
public static void main(string[] args) { 
activexcomponent component = new activexcomponent("word.application"); 
string infile = "c:\\test.doc"; 
string tpfile = "c:\\temp.htm"; 
string otfile = "c:\\temp.xml"; 
boolean flag = false; 
try { 
component.setproperty("visible", new variant(false)); 
object wordacc = component.getproperty("document．").todispatch(); 
object wordfile = dispatch.invoke(wordacc,"open", dispatch.method, 
new object[]{infile,new variant(false), new variant(true)}, 
new int[1] ).todispatch(); 
dispatch.invoke(wordfile,"saveas", dispatch.method, new object[]{
tpfile,new variant(8)}, new int[1]); 
variant f = new variant(false); 
dispatch.call(wordfile, "close", f); 
flag = true; 
} catch (exception e) { 
e.printstacktrace(); 
} finally { 
component.invoke("quit", new variant[] {}); 
} 
} 
}

2. 用apache的poi来抽取word，excel

poi是apache的一个项目，不过就算用poi你可能都觉得很烦，不过不要紧，这里提供了更加简单的一个接口给你：

下载经过封装后的poi包： http://jakarta.apache.org/poi/

下载之后，放到你的classpath就可以了，下面是如何使用它的一个例子：

import java.io.*; 
import org.textmining.text.extraction.wordextractor; 
/** 
* title: word extraction 
* description: email:chris@matrix.org.cn 
* copyright: matrix copyright (c) 2003 
* company: matrix.org.cn 
* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 

public class pdfextractor { 
public pdfextractor() { 
} 
public static void main(string args[]) throws exception 
{ 
fileinputstream in = new fileinputstream ("c:\\a.doc"); 
wordextractor extractor = new wordextractor(); 
string str = extractor.extracttext(in); 
system.out.println("the result length is"+str.length()); 
system.out.println("the result is"+str); 
} 
} 
public class pdfextractor { 
public pdfextractor() { }
 public static void main(string args[])
 throws exception { 
fileinputstream in = new fileinputstream ("c:\\a.doc"); 
wordextractor extractor = new wordextractor(); 
string str = extractor.extracttext(in); 
system.out.println("the result length is"+str.length()); 
system.out.println("the result is"+str); 
} }

3. pdfbox-用来抽取pdf文件

但是pdfbox对中文支持还不好，先下载pdfbox： http://www.pdfbox.org/

下面是一个如何使用pdfbox抽取pdf文件的例子：

import org.pdfbox.pdmodel.pddocument． 
import org.pdfbox.pdfparser.pdfparser; 
import java.io.*; 
import org.pdfbox.util.pdftextstripper; 
import java.util.date; 
/** 
* title: pdf extraction 
* description: email:chris@matrix.org.cn 
* copyright: matrix copyright (c) 2003 
* company: matrix.org.cn 
* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 

public class pdfextracter{ 

public pdfextracter(){ 
} 
public string gettextfrompdf(string filename) throws exception 
{ 
string temp=null; 
pddocument．nbsppdfdocument．null; 
fileinputstream is=new fileinputstream(filename); 
pdfparser parser = new pdfparser( is ); 
parser.parse(); 
pdfdocument．nbsp= parser.getpddocument．); 
bytearrayoutputstream out = new bytearrayoutputstream(); 
outputstreamwriter writer = new outputstreamwriter( out ); 
pdftextstripper stripper = new pdftextstripper(); 
stripper.writetext(pdfdocument．getdocument．), writer ); 
writer.close(); 
byte[] contents = out.tobytearray(); 

string ts=new string(contents); 
system.out.println("the string length is"+contents.length+"\n"); 
return ts; 
} 
public static void main(string args[]) 
{ 
pdfextracter pf=new pdfextracter(); 
pddocument．nbsppdfdocument．nbsp= null; 

try{ 
string ts=pf.gettextfrompdf("c:\\a.pdf"); 
system.out.println(ts); 
} 
catch(exception e) 
{ 
e.printstacktrace(); 
} 
} 

}

4. 抽取支持中文的pdf文件－xpdf

xpdf是一个开源项目，我们可以调用他的本地方法来实现抽取中文pdf文件。

下载xpdf函数包： http://www.foolabs.com/xpdf/

同时需要下载支持中文的补丁包，按照readme放好中文的patch，就可以开始写调用本地方法的java程序了。

下面是一个如何调用的例子：

import java.io.*; 
/** 
* title: pdf extraction 
* description: email:chris@matrix.org.cn 
* copyright: matrix copyright (c) 2003 
* company: matrix.org.cn 
* @author chris 
* @version 1.0,who use this example pls remain the declare 
*/ 


public class pdfwin { 
public pdfwin() { 
} 
public static void main(string args[]) throws exception 
{ 
string path_to_xpdf="c:\\program files\\xpdf\\pdftotext.exe"; 
string filename="c:\\a.pdf"; 
string[] cmd = new string[] { path_to_xpdf, "-enc", "utf-8", "-q", filename, "-"}; 
process p = runtime.getruntime().exec(cmd); 
bufferedinputstream bis = new bufferedinputstream(p.getinputstream()); 
inputstreamreader reader = new inputstreamreader(bis, "utf-8"); 
stringwriter out = new stringwriter(); 
char [] buf = new char[10000]; 
int len; 
while((len = reader.read(buf))>= 0) { 
//out.write(buf, 0, len); 
system.out.println("the length is"+len); 
} 
reader.close(); 
string ts=new string(buf); 
system.out.println("the str is"+ts); 
} 
}

技术文档欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进

java抽取word和pdf格式文件的四种武器

https访问

7*24小时服务

专业一线支持

7天无理由退款

关于我们

产品与服务

常见问题

技术支持

欢迎登录福佳jsp空间

技术文档 欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进

java抽取word和pdf格式文件的四种武器

https访问

7*24小时服务

专业一线支持

7天无理由退款

关于我们

产品与服务

常见问题

技术支持

技术文档欢迎使用技术文档，我们为你提供从新手到专业开发者的所有资源，你也可以通过它日益精进