FieldCacheImpl的作用,用于将结果按字段排序 (sort)的算法
在lucene里,除了默认的排序算法以外,它支持按某几个字段排序,类似数据库的“order by“。下面介绍一下它的原理。
在lucene里,做索引的term是先按fieldname,再按fieldvalue排序的,这样的话同一个field的term是连续的,类似于上图中的上半部分,F1是fieldname,V1,V2,V3是fieldvalue,0~10是DocID,这里总共11个Document。若要将检索结果按F1排序的话,lucene v1.4.3会将所有F1的value加载到内存中去,形成如上图下半部分,它是一个数组(该数组存储在FieldCacheImpl中),下标是DocID,数组元素内容是F1的值。加载之后检索的过程中就根据DocID去该数组里找相应的值来计算每个document的得分。示列代码:
// 按ID字段来排序
Searcher searcher = new IndexSearcher(indexDir);
Analyzer analyzer = new ChineseAnalyzer();
Query query = QueryParser.parse(keyword, "contents", analyzer);
Sort sort = new Sort(new SortField("ID",SortField.STRING));
hits = searcher.search(query,sort);
Lucene V1.4.3支持整型,浮点,字符串字段的排序,而且这些字段不能被分词“untokenized”。
下面的英文介绍肯定有帮助^_^。
/**
* Encapsulates sort criteria for returned hits.
*
* <p>The fields used to determine sort order must be carefully chosen.
* Documents must contain a single term in such a field,
* and the value of the term should indicate the document's relative position in
* a given sort order. The field must be indexed, but should not be tokenized,
* and does not need to be stored (unless you happen to want it back with the
* rest of your document data). In other words:
*
* <dl><dd><code>document.add (new Field ("byNumber", Integer.toString(x), false, true, false));</code>
* </dd></dl>
*
* <p><h3>Valid Types of Values</h3>
*
* <p>There are three possible kinds of term values which may be put into
* sorting fields: Integers, Floats, or Strings. Unless
* {@link SortField SortField} objects are specified, the type of value
* in the field is determined by parsing the first term in the field.
*
* <p>Integer term values should contain only digits and an optional
* preceeding negative sign. Values must be base 10 and in the range
* <code>Integer.MIN_VALUE</code> and <code>Integer.MAX_VALUE</code> inclusive.
* Documents which should appear first in the sort
* should have low value integers, later documents high values
* (i.e. the documents should be numbered <code>1..n</code> where
* <code>1</code> is the first and <code>n</code> the last).
*
* <p>Float term values should conform to values accepted by
* {@link Float Float.valueOf(String)} (except that <code>NaN</code>
* and <code>Infinity</code> are not supported).
* Documents which should appear first in the sort
* should have low values, later documents high values.
*
* <p>String term values can contain any valid String, but should
* not be tokenized. The values are sorted according to their
* {@link Comparable natural order}. Note that using this type
* of term value has higher memory requirements than the other
* two types.
*
* <p><h3>Object Reuse</h3>
*
* <p>One of these objects can be
* used multiple times and the sort order changed between usages.
*
* <p>This class is thread safe.
*
* <p><h3>Memory Usage</h3>
*
* <p>Sorting uses of caches of term values maintained by the
* internal HitQueue(s). The cache is static and contains an integer
* or float array of length <code>IndexReader.maxDoc()</code> for each field
* name for which a sort is performed. In other words, the size of the
* cache in bytes is:
*
* <p><code>4 * IndexReader.maxDoc() * (# of different fields actually used to sort)</code>
*
* <p>For String fields, the cache is larger: in addition to the
* above array, the value of every term in the field is kept in memory.
* If there are many unique terms in the field, this could
* be quite large.
*
* <p>Note that the size of the cache is not affected by how many
* fields are in the index and <i>might</i> be used to sort - only by
* the ones actually used to sort a result set.
*
* <p>The cache is cleared each time a new <code>IndexReader</code> is
* passed in, or if the value returned by <code>maxDoc()</code>
* changes for the current IndexReader. This class is not set up to
* be able to efficiently sort hits from more than one index
* simultaneously.*/

闽公网安备 35060202000074号