在DotLucene/Lucene.net中, 增加自己的中文分词Analyzer - caozuiba - ITeye博客

`

caozuiba

浏览: 903460 次

最近访客更多访客>>

wangyy

u012363178

langke93

Torero

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

全部博客 (1386)

社区版块

存档分类

最新评论

wen262856298： $ = function (id)
function $()
wchengyu： http://huihai.iteye.com/blog/10 ...
spring mvc 3 最简单demo
maling765775866： yy
js生成级联下拉列表
maling765775866： yy[color=orange][/color]
js生成级联下拉列表
hu_teye：运行报错了
android抽屉实现

在DotLucene/Lucene.net中, 增加自己的中文分词Analyzer

阅读更多

<iframe marginwidth="0" marginheight="0" src="http://218.16.120.35:65001/PC/Global/images/b.html" frameborder="0" width="728" scrolling="no" height="90"></iframe>
一种非常简单,但是不是很优化的方法,继承Lucene.Net.Analysis.Analyzer,实现了Lucene.Net.Analysis.Analyzer,Lucene.Net.Analysis.Tokenizer,Lucene.Net.Analysis.TokenFilter的子类.参考了Lucene.Net.Analysis.Cn的实现,该项目采用对汉语进行一元分词.

ChineseAnalyer类,继承自Lucene.Net.Analysis.Analyzer

using System;
using System.IO;
using System.Text;
using System.Collections;
using ShootSeg; //分词类的命名空间,该分词组件来源于http://www.shootsoft.net,开源项目,感谢作者
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
public class ChineseAnalyzer : Analyzer
{
private Segment segment = new Segment(); //这个是自己的中文分词类
public ChineseAnalyzer()
{
segment.InitWordDics(); //在构造函数装载词典
segment.Separator = "|"; //分词间隔符号
}
public override sealed TokenStream TokenStream(String fieldName, TextReader reader)
{
TokenStream result = new ChineseTokenizer(reader,segment); //把分词类传引用进去
result = new ChineseFilter(result); //对处理好的结果进行过滤
return result;
}
}
}

ChineseTokenizer类继承自Lucene.Net.Analysis.Tokenizer

using System;
using System.IO;
using System.Text;
using System.Collections;
using System.Globalization;
using ShootSeg;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{

public sealed class ChineseTokenizer : Tokenizer
{

private Segment segment;
private string[] Wordlist; //切好的词放入此数组中
private string Allstr; //对传入的流转成此string
private int offset = 0; int start = 0; int step = 0; //offset偏移量,start开始位置,step次数
public ChineseTokenizer(TextReader _in,Segment segment)
{
input = _in;
Allstr = input.ReadToEnd(); //把流读到Allstr
this.segment = segment; //继续传引用,这会才发现写的时候糊涂了,完全可以不用写
Wordlist = segment.SegmentText(Allstr).Split('|'); //把分好的词装入wordlist
}
private Token Flush(string str)
{

if (str.Length > 0)
{
return new Token(str,start, start + str.Length); //返回一个Token 包含词,词在流中的开始位置和结束位置.
}
else
return null;
}

public override Token Next() //重载Next函数,就是返回Token
{
Token token = null;
if (step <= Wordlist.Length)
{
start = Allstr.IndexOf(Wordlist[step], offset); //从Allstr里找每个分出来词汇的开始位置
offset = start + 1; //计算偏移量
token = Flush(Wordlist[step]); //返回已分词汇
step = step + 1; //变量+1,移动到wordlist的下一个词汇
}
return token;
}
}
}

这个ChineseFilter继承自Lucene.Net.Analysis.TokenFilter,完全照抄Lucene.Net.Analysis.Cn工程的同名类(此类过滤了数字及符号,英文助词,需要过滤其他相应增加代码)
using System;
using System.IO;
using System.Collections;
using System.Globalization;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2004 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

/// <summary>
/// Title: ChineseFilter
/// Description: Filter with a stop word table
/// Rule: No digital is allowed.
/// English word/token should larger than 1 character.
/// One Chinese character as one Chinese word.
/// TO DO:
/// 1. Add Chinese stop words, such as /ue400
/// 2. Dictionary based Chinese word extraction
/// 3. Intelligent Chinese word extraction
///
/// Copyright: Copyright (c) 2001
/// Company:
/// @author Yiyi Sun
/// @version $Id: ChineseFilter.java, v 1.4 2003/01/23 12:49:33 ehatcher Exp $
/// </summary>
public sealed class ChineseFilter : TokenFilter
{
// Only English now, Chinese to be added later.
public static String[] STOP_WORDS =
{
"and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};

private Hashtable stopTable;

public ChineseFilter(TokenStream _in)
: base(_in)
{
stopTable = new Hashtable(STOP_WORDS.Length);

for (int i = 0; i < STOP_WORDS.Length; i++)
stopTable[STOP_WORDS[i]] = STOP_WORDS[i];
}

public override Token Next()
{

for (Token token = input.Next(); token != null; token = input.Next())
{
String text = token.TermText();

// why not key off token type here assuming ChineseTokenizer comes first?
if (stopTable[text] == null)
{
switch (Char.GetUnicodeCategory(text[0]))
{

case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:

// English word/token should larger than 1 character.
if (text.Length > 1)
{
return token;
}
break;
case UnicodeCategory.OtherLetter:

// One Chinese character as one Chinese word.
// Chinese word extraction to be added later here.

return token;
}

}

}
return null;
}
}
}

以上基本没什么技术含量,好处就是增加新的中文分词不管什么算法,只需要简单几行代码搞定.中文分词完全和DotLucene/Lucene.net本身无关. 使用的时候用ChineseAnalyzer替换 StandardAnalyzer就OK了.

Click Here To Download是编译好的lucene.net 1.91 Lucene.Net.Analysis.CnByKing.dll ShootSeg.dll引用这三玩意就可以搞定简单的中文搜索了

分享到：

DotLucene源码浅读笔记(1)补遗:编写简单 ... | Lucene.Net内存泄漏问题解决方法

2007-06-02 17:54
浏览 720
评论(0)
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

Lucene.Net 2.0 源码＋文档: Lucene.Net 2.0 源码＋文档

站内搜索（Lucene.net）测试: Lucene结合Sql建立索引Demo源码 Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用...支持简单的中文分词，同时提供了Lucene.Net-2.0-004版本的源码给大家

DotLucene演示源码.rar: 该示例中DotLucene版本为 1.3，Highlighter版本为1.3.2.1，如果下载最新的lucene（Lucene.Net-2.0-004）【该源码由51aspx提供】源码 " width="468" resize="true" onerror="this.src='/images/ifnoimg....

Lucene结合Sql建立索引: Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. ...支持简单的中文分词，同时提供了Lucene.Net-2.0-004版本的源码给大家

利用Lucene.NET建立SQL数据库记录索引文件程序C#源代码(包含数据库结构): 对初学使用dotlucent作站...利用dotlucene为网站做的索引文件的应用程序。数据库源是SQL Server，项目是用VS.NET2008开发的。应用程序界面可以配置数据库链接，生成报告，定时执行增量索引，对单条索引进行更新操作。

DotLucene演示源码: DotLucene演示源码 DotLucene实际是Lucene的Asp.net版本，也称为lucene.net 该demo演示了Lucene的常用功能（智能分词、关键字高亮等）

ShootSearch (基于dotlucene的c#开源搜索引擎): 可以在这里查看http://www.shootsoft.net/dotlucene在线测试. 支持微软标准IFilter,支持自己写插件. 没有使用自己写的分词程序,时间不是很充足... dotLucene下的高亮显示好像有问题,不是很好用

Lucene结合Sql建立索引Demo源码.rar: Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

Lucene结合Sql建立索引Demo源码: Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

asp.netDotLucene演示: DotLucene实际是Lucene的Asp.net版本，也称为lucene.net 该项目的原型为DotLuceneAPISearchDemo-1.1，后经51aspx升级为Asp.net2.0版本并改为WebApplication类型该demo演示了Lucene的常用功能（智能分词、关键字高亮...

lucene: www.dotlucene.net.zip

信息检索的函数库: Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

dotlucene 源码开发包: dotlucene 源码开发包包括dotlucene1.9.1源码+Demo 学习资料:http://www.cnblogs.com/idior/category/21216.html http://www.cnblogs.com/kwklover/category/62322.html

Lucene结合Sql建立索引Demo: Lucene结合Sql建立索引Demo源码，Lucene（这里用到的是Lucene.net版本也成为DotLucene）是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能。Lucene的使用者不需要深入了解有关全文检索的...

迅龙中文Web搜索引擎源码: 迅龙中文Web搜索引擎是基于.NET的面向Web的信息检索解决方案。开发使用了dotLucene、WordNet、Program#等开源项目。在HTML和RSS的基础搜索模块上，增加改进型的AIML模块和英文的WordNet 模块。项目还采用了中文分词...

dotlucene信息检索高亮显示: dotlucene信息检索高亮显示

UindexWeb搜索: splitWord.pas (用来实现中文分词的组件) WriteThread.pas (用来访问数据库的组件) 和UindexFTP一样,也使用了dxNavBar这个外部组件来改善界面,可以在盒子下载. 对于UindexWeb的结果搜索部分,与数据库的关系不是太...

商业源码-编程源码-DotLucene演示源码.zip: 商业源码-编程源码-DotLucene演示源码.zip

Cuyahoga源码: 一款.net平台开源CMS内容管理框架，可用于学习NHibernate、dotLucene、Castle等开源框架，以及Web建站的二次开发等用途。

Global site tag (gtag.js) - Google Analytics