服务热线:13616026886

技术文档 欢迎使用技术文档,我们为你提供从新手到专业开发者的所有资源,你也可以通过它日益精进

位置:首页 > 技术文档 > JAVA > 新手入门 > 基础入门 > 查看文档

lucene按字段排序的算法

FieldCacheImpl的作用,用于将结果按字段排序 (sort)的算法

在lucene里,除了默认的排序算法以外,它支持按某几个字段排序,类似数据库的“order by“。下面介绍一下它的原理。
在lucene里,做索引的term是先按fieldname,再按fieldvalue排序的,这样的话同一个field的term是连续的,类似于上图中的上半部分,F1是fieldname,V1,V2,V3是fieldvalue,0~10是DocID,这里总共11个Document。若要将检索结果按F1排序的话,lucene v1.4.3会将所有F1的value加载到内存中去,形成如上图下半部分,它是一个数组(该数组存储在FieldCacheImpl中),下标是DocID,数组元素内容是F1的值。加载之后检索的过程中就根据DocID去该数组里找相应的值来计算每个document的得分。示列代码:
    // 按ID字段来排序
    Searcher searcher = new IndexSearcher(indexDir);
    Analyzer analyzer = new ChineseAnalyzer();   
    Query query = QueryParser.parse(keyword, "contents", analyzer);
    Sort sort = new Sort(new SortField("ID",SortField.STRING));
    hits = searcher.search(query,sort);
Lucene V1.4.3支持整型,浮点,字符串字段的排序,而且这些字段不能被分词“untokenized”。
下面的英文介绍肯定有帮助^_^。

/**
 * Encapsulates sort criteria for returned hits.
 *
 * <p>The fields used to determine sort order must be carefully chosen.
 * Documents must contain a single term in such a field,
 * and the value of the term should indicate the document's relative position in
 * a given sort order.  The field must be indexed, but should not be tokenized,
 * and does not need to be stored (unless you happen to want it back with the
 * rest of your document data).  In other words:
 *
 * <dl><dd><code>document.add (new Field ("byNumber", Integer.toString(x), false, true, false));</code>
 * </dd></dl>
 *
 * <p><h3>Valid Types of Values</h3>
 *
 * <p>There are three possible kinds of term values which may be put into
 * sorting fields: Integers, Floats, or Strings.  Unless
 * {@link SortField SortField} objects are specified, the type of value
 * in the field is determined by parsing the first term in the field.
 *
 * <p>Integer term values should contain only digits and an optional
 * preceeding negative sign.  Values must be base 10 and in the range
 * <code>Integer.MIN_VALUE</code> and <code>Integer.MAX_VALUE</code> inclusive.
 * Documents which should appear first in the sort
 * should have low value integers, later documents high values
 * (i.e. the documents should be numbered <code>1..n</code> where
 * <code>1</code> is the first and <code>n</code> the last).
 *
 * <p>Float term values should conform to values accepted by
 * {@link Float Float.valueOf(String)} (except that <code>NaN</code>
 * and <code>Infinity</code> are not supported).
 * Documents which should appear first in the sort
 * should have low values, later documents high values.
 *
 * <p>String term values can contain any valid String, but should
 * not be tokenized.  The values are sorted according to their
 * {@link Comparable natural order}.  Note that using this type
 * of term value has higher memory requirements than the other
 * two types.
 *
 * <p><h3>Object Reuse</h3>
 *
 * <p>One of these objects can be
 * used multiple times and the sort order changed between usages.
 *
 * <p>This class is thread safe.
 *
 * <p><h3>Memory Usage</h3>
 *
 * <p>Sorting uses of caches of term values maintained by the
 * internal HitQueue(s).  The cache is static and contains an integer
 * or float array of length <code>IndexReader.maxDoc()</code> for each field
 * name for which a sort is performed.  In other words, the size of the
 * cache in bytes is:
 *
 * <p><code>4 * IndexReader.maxDoc() * (# of different fields actually used to sort)</code>
 *
 * <p>For String fields, the cache is larger: in addition to the
 * above array, the value of every term in the field is kept in memory.
 * If there are many unique terms in the field, this could
 * be quite large.
 *
 * <p>Note that the size of the cache is not affected by how many
 * fields are in the index and <i>might</i> be used to sort - only by
 * the ones actually used to sort a result set.
 *
 * <p>The cache is cleared each time a new <code>IndexReader</code> is
 * passed in, or if the value returned by <code>maxDoc()</code>
 * changes for the current IndexReader.  This class is not set up to
 * be able to efficiently sort hits from more than one index
 * simultaneously.*/

 

扫描关注微信公众号