`
caozuiba
  • 浏览: 903460 次
文章分类
社区版块
存档分类
最新评论

在DotLucene/Lucene.net中, 增加自己的中文分词Analyzer

 
阅读更多
<iframe marginwidth="0" marginheight="0" src="http://218.16.120.35:65001/PC/Global/images/b.html" frameborder="0" width="728" scrolling="no" height="90"></iframe>
一种非常简单,但是不是很优化的方法,继承Lucene.Net.Analysis.Analyzer,实现了Lucene.Net.Analysis.Analyzer,Lucene.Net.Analysis.Tokenizer,Lucene.Net.Analysis.TokenFilter的子类.参考了Lucene.Net.Analysis.Cn的实现,该项目采用对汉语进行一元分词.

ChineseAnalyer类,继承自Lucene.Net.Analysis.Analyzer

using System;
using System.IO;
using System.Text;
using System.Collections;
using ShootSeg; //分词类的命名空间,该分词组件来源于http://www.shootsoft.net,开源项目,感谢作者
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
public class ChineseAnalyzer : Analyzer
{
private Segment segment = new Segment(); //这个是自己的中文分词类
public ChineseAnalyzer()
{
segment.InitWordDics(); //在构造函数装载词典
segment.Separator = "|"; //分词间隔符号
}
public override sealed TokenStream TokenStream(String fieldName, TextReader reader)
{
TokenStream result = new ChineseTokenizer(reader,segment); //把分词类传引用进去
result = new ChineseFilter(result); //对处理好的结果进行过滤
return result;
}
}
}

ChineseTokenizer类继承自Lucene.Net.Analysis.Tokenizer

using System;
using System.IO;
using System.Text;
using System.Collections;
using System.Globalization;
using ShootSeg;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{

public sealed class ChineseTokenizer : Tokenizer
{

private Segment segment;
private string[] Wordlist; //切好的词放入此数组中
private string Allstr; //对传入的流转成此string
private int offset = 0; int start = 0; int step = 0; //offset偏移量,start开始位置,step次数
public ChineseTokenizer(TextReader _in,Segment segment)
{
input = _in;
Allstr = input.ReadToEnd(); //把流读到Allstr
this.segment = segment; //继续传引用,这会才发现写的时候糊涂了,完全可以不用写
Wordlist = segment.SegmentText(Allstr).Split('|'); //把分好的词装入wordlist
}
private Token Flush(string str)
{

if (str.Length > 0)
{
return new Token(str,start, start + str.Length); //返回一个Token 包含词,词在流中的开始位置和结束位置.
}
else
return null;
}

public override Token Next() //重载Next函数,就是返回Token
{
Token token = null;
if (step <= Wordlist.Length)
{
start = Allstr.IndexOf(Wordlist[step], offset); //从Allstr里找每个分出来词汇的开始位置
offset = start + 1; //计算偏移量
token = Flush(Wordlist[step]); //返回已分词汇
step = step + 1; //变量+1,移动到wordlist的下一个词汇
}
return token;
}
}
}

这个ChineseFilter继承自Lucene.Net.Analysis.TokenFilter,完全照抄Lucene.Net.Analysis.Cn工程的同名类(此类过滤了数字及符号,英文助词,需要过滤其他相应增加代码)
using System;
using System.IO;
using System.Collections;
using System.Globalization;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2004 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

/// <summary>
/// Title: ChineseFilter
/// Description: Filter with a stop word table
/// Rule: No digital is allowed.
/// English word/token should larger than 1 character.
/// One Chinese character as one Chinese word.
/// TO DO:
/// 1. Add Chinese stop words, such as /ue400
/// 2. Dictionary based Chinese word extraction
/// 3. Intelligent Chinese word extraction
///
/// Copyright: Copyright (c) 2001
/// Company:
/// @author Yiyi Sun
/// @version $Id: ChineseFilter.java, v 1.4 2003/01/23 12:49:33 ehatcher Exp $
/// </summary>
public sealed class ChineseFilter : TokenFilter
{
// Only English now, Chinese to be added later.
public static String[] STOP_WORDS =
{
"and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};

private Hashtable stopTable;

public ChineseFilter(TokenStream _in)
: base(_in)
{
stopTable = new Hashtable(STOP_WORDS.Length);

for (int i = 0; i < STOP_WORDS.Length; i++)
stopTable[STOP_WORDS[i]] = STOP_WORDS[i];
}

public override Token Next()
{

for (Token token = input.Next(); token != null; token = input.Next())
{
String text = token.TermText();

// why not key off token type here assuming ChineseTokenizer comes first?
if (stopTable[text] == null)
{
switch (Char.GetUnicodeCategory(text[0]))
{

case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:

// English word/token should larger than 1 character.
if (text.Length > 1)
{
return token;
}
break;
case UnicodeCategory.OtherLetter:

// One Chinese character as one Chinese word.
// Chinese word extraction to be added later here.

return token;
}

}

}
return null;
}
}
}

以上基本没什么技术含量,好处就是增加新的中文分词不管什么算法,只需要简单几行代码搞定.中文分词完全和DotLucene/Lucene.net本身无关. 使用的时候用ChineseAnalyzer替换 StandardAnalyzer就OK了.

Click Here To Download是编译好的lucene.net 1.91 Lucene.Net.Analysis.CnByKing.dll ShootSeg.dll引用这三玩意就可以搞定简单的中文搜索了
分享到:
评论

相关推荐

    Lucene.Net 2.0 源码+文档

    Lucene.Net 2.0 源码+文档

    站内搜索(Lucene.net)测试

    Lucene结合Sql建立索引Demo源码 Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用...支持简单的中文分词,同时提供了Lucene.Net-2.0-004版本的源码给大家

    DotLucene演示源码.rar

    该示例中DotLucene版本为 1.3,Highlighter版本为1.3.2.1,如果下载最新的lucene(Lucene.Net-2.0-004) 【该源码由51aspx提供】   源码 " width="468" resize="true" onerror="this.src='/images/ifnoimg....

    Lucene结合Sql建立索引

    Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. ...支持简单的中文分词,同时提供了Lucene.Net-2.0-004版本的源码给大家

    利用Lucene.NET建立SQL数据库记录索引文件程序C#源代码(包含数据库结构)

    对初学使用dotlucent作站...利用dotlucene为网站做的索引文件的应用程序。 数据库源是SQL Server,项目是用VS.NET2008开发的。 应用程序界面可以配置数据库链接,生成报告,定时执行增量索引,对单条索引进行更新操作。

    DotLucene演示源码

    DotLucene演示源码 DotLucene实际是Lucene的Asp.net版本,也称为lucene.net 该demo演示了Lucene的常用功能(智能分词、关键字高亮等)

    ShootSearch (基于dotlucene的c#开源搜索引擎)

    可以在这里查看http://www.shootsoft.net/dotlucene在线测试. 支持微软标准IFilter,支持自己写插件. 没有使用自己写的分词程序,时间不是很充足... dotLucene下的高亮显示好像有问题,不是很好用

    Lucene结合Sql建立索引Demo源码.rar

    Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

    Lucene结合Sql建立索引Demo源码

    Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

    asp.netDotLucene演示

    DotLucene实际是Lucene的Asp.net版本,也称为lucene.net 该项目的原型为DotLuceneAPISearchDemo-1.1,后经51aspx升级为Asp.net2.0版本并改为WebApplication类型 该demo演示了Lucene的常用功能(智能分词、关键字高亮...

    lucene

    www.dotlucene.net.zip

    信息检索的函数库

    Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...

    dotlucene 源码 开发包

    dotlucene 源码 开发包 包括dotlucene1.9.1源码+Demo 学习资料:http://www.cnblogs.com/idior/category/21216.html http://www.cnblogs.com/kwklover/category/62322.html

    Lucene结合Sql建立索引Demo

    Lucene结合Sql建立索引Demo源码,Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能。Lucene的使用者不需要深入了解有关全文检索的...

    迅龙中文Web搜索引擎源码

    迅龙中文Web搜索引擎是基于.NET的面向Web的信息检索解决方案。开发使用了dotLucene、WordNet、Program#等开源项目。在HTML和RSS的基础搜索模块上,增加改进型的AIML模块和英文的WordNet 模块。项目还采用了中文分词...

    dotlucene信息检索高亮显示

    dotlucene信息检索高亮显示

    UindexWeb搜索

    splitWord.pas (用来实现中文分词的组件) WriteThread.pas (用来访问数据库的组件) 和UindexFTP一样,也使用了dxNavBar这个外部组件来改善界 面,可以在盒子下载. 对于UindexWeb的结果搜索部分,与数据库的关系不是太...

    商业源码-编程源码-DotLucene演示源码.zip

    商业源码-编程源码-DotLucene演示源码.zip

    Cuyahoga源码

    一款.net平台开源CMS内容管理框架,可用于学习NHibernate、dotLucene、Castle等开源框架,以及Web建站的二次开发等用途。

Global site tag (gtag.js) - Google Analytics