[How to] Make custom search with Nutch(v 1.0)?(转)
http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
What is Nutch?
Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make it great:
- Open source
- Has a web-crawler which understands and indexes html, rtf and pdf format + all links that it might encounter
- Its search engine is based on Lucene
- Has a plugin based architecture, which means we can have our own plugins for indexing and searching, without a single line of code change to the core nutch.jar
- Uses Hadoop for storing indexes, so its pretty scalable
Use case
Suppose we want to search for the author of the website by his email id.
First things first: lets index our custom data
Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code:
package com.swayam.nutch.plugins.indexfilter; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Inlinks; import org.apache.nutch.indexer.IndexingException; import org.apache.nutch.indexer.IndexingFilter; import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.indexer.lucene.LuceneWriter; import org.apache.nutch.parse.Parse; /** *@author paawak */ public class EmailIndexingFilter implements IndexingFilter { private static final Log LOG = LogFactory.getLog(EmailIndexingFilter.class); private static final String KEY_CREATOR_EMAIL = "email"; private Configuration conf; public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException { // look up email of the author based on the url of the site String creatorEmail = EmailLookup.getCreatorEmail(url.toString()); LOG.info("######## creatorEmail = " + creatorEmail); if (creatorEmail != null) { doc.add(KEY_CREATOR_EMAIL, creatorEmail); } return doc; } public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions(KEY_CREATOR_EMAIL, LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); } public Configuration getConf() { return conf; } public void setConf(Configuration conf) { this.conf = conf; } }
Also, you need to create a plugin.xml:
<plugin id="index-email" name="Email Indexing Filter" version="1.0.0" provider-name="swayam"> <runtime> <library name="EmailIndexingFilterPlugin.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter" name="Email Indexing Filter" point="org.apache.nutch.indexer.IndexingFilter"> <implementation id="index-email" class="com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter" /> </extension> </plugin>
This done, create a new folder in the $NUTCH_HOME/plugins and put the jar and the plugin.xml there.
Now we have to activate this plugin. To do this, we have to edit the conf/nutch-site.xml.
<property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|parse-(text|html)|index-(basic|email)|query-(basic|site|url)</value> <description>Regular expression naming plugin id names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property>
Now, how do I search my indexed data?
Option 1 [cumbersome]:
Add my own query plugin:
package com.swayam.nutch.plugins.queryfilter; import org.apache.nutch.searcher.FieldQueryFilter; /** *@author paawak */ public class MyEmailQueryFilter extends FieldQueryFilter { public MyEmailQueryFilter() { super("email"); } }
Do not forget to edit the plugin.xml.
<plugin id="query-email" name="Email Query Filter" version="1.0.0" provider-name="swayam"> <runtime> <library name="EmailQueryFilterPlugin.jar"> <export name="*"/> </library> </runtime> <requires> <import plugin="nutch-extensionpoints"/> </requires> <extension id="com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter" name="Email Query Filter" point="org.apache.nutch.searcher.QueryFilter"> <implementation id="query-email" class="com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter"> <parameter name="fields" value="email"/> </implementation> </extension> </plugin>
This line is particularly important:
<parameter name=”fields” value=”email”/>
If you skip this line, you will never be able to see this in search results.
The only catch here is you have to prepend the keyword email: to the search key. For example, if you want to search for jsmith@mydomain.com, you have to search for email:jsmith@mydomain.com or email:jsmith.
There is an easier and more elegant way :), read on…
Option 2 [smart]
Use the existing query-basic plugin.
This involves editing just one file: conf/nutch-default.xml.
In the default distribution, you can see some commented lines like this:
<!-- <property> <name>query.basic.description.boost</name> <value>1.0</value> <description> Declares a custom field and its boost to be added to the default fields of the Lucene query. </description> </property> -->
All you have to do is un-comment them and put your custom field, email, in our case in place of description. The resulting fragment will look like:
<property> <name>query.basic.email.boost</name> <value>1.0</value> <description> Queries the author of the site by his email-id </description> </property>
With this while looking for jsmith@mydomain.com, you can simply enter jsmith@mydomain.com or a part the name like jsmit.
Building a Nutch plugin
The preferred way is by ant, but I have used maven with the following dependencies:
<project> ... <dependencies> ... <!-- nutch --> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>2.4.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-misc</artifactId> <version>2.4.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.nutch</groupId> <artifactId>nutch</artifactId> <version>1.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.taglibs</groupId> <artifactId>taglibs-i18n</artifactId> <version>1.0.N20030822</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika</artifactId> <version>0.1-incubating</version> <scope>provided</scope> </dependency> <dependency> <groupId>xerces</groupId> <artifactId>xerces</artifactId> <version>2.6.2</version> <scope>provided</scope> </dependency> <dependency> <groupId>xerces</groupId> <artifactId>xerces-apis</artifactId> <version>2.6.2</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.jets3t.service</groupId> <artifactId>jets3t</artifactId> <version>0.6.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>oro</groupId> <artifactId>oro</artifactId> <version>2.0.8</version> <scope>provided</scope> </dependency> <dependency> <groupId>com.ibm.icu</groupId> <artifactId>icu4j</artifactId> <version>4.0.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.19.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.solr</groupId> <artifactId>solr-common</artifactId> <version>1.3.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.solr</groupId> <artifactId>solrj</artifactId> <version>1.3.0</version> <scope>provided</scope> </dependency> <!-- end nutch --> ... </dependencies> ... </project>
Useful links
Be warned that these are a bit out dated, so they may not be correct verbatim.
- Introduction to Nutch, Part 1: Crawling
- Introduction to Nutch, Part 2: Searching
- Nutch wiki: WritingPluginExample-0.9
转自:http://www.cnblogs.com/bnuvincent/archive/2009/12/20/1628066
知识点
相关教程
更多Custom SOLR Search Components - 2 Dev Tricks
I've been building some custom search components for SOLR lately, so wanted to share a couple of things I learned in the process. Most likely this is old hat to people who have been doing this for a w
Solr: a custom Search RequestHandler
As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework
Riak Search
Basho: Riak Search Riak Search Introduction Operations Indexing Querying Persistence Major Components Replication Further Reading Introduction Riak Search is a distributed, easi
Haystack - Search for Django
Haystack - Search for Django Search doesn't have to be hard. Haystack lets you write your search code once and choose the search engine you want it to run on. With a familiar API that
My W3C Custom Mapping File
[hdhw] HotKey=W Tip=Train Dragonha|cffffcc00w|rk Rider [Amls] HotKey=Z Tip=Aerial Shackles -- (|cffffcc00Z|r) [Aclf] HotKey=X Tip=Cloud -- (|cffffcc00X|r) [hfoo] HotKey=Q Tip=Tra
Realtime Search: Solr vs Elasticsearch
Realtime Search: Solr vs Elasticsearch | Socialcast Engineering Realtime Search: Solr vs Elasticsearch Ryan SonnekRyan Sonnek Tuesday May 31st, 2011 by Ryan Sonnek 19 comments Tweet What
Faceted search
http://en.wikipedia.org/wiki/Faceted_search http://wiki.apache.org/solr/SolrFacetingOverview http://idratherbewriting.com/2010/05/20/faceted-classification-faceted-search-organizing-content-6/ http://
Open Source Search Engines in Java
Open Source Search Engines in Java Open Source Search Engines in Java Compass The Compass Framework is a first class open source Java framework, enabling the power of Search Engine semantics to your a
Full-Text Search in ASP.NET using Lucene.NET
This post is about the full-text search engine Lucene.NET and how I integrated it into BugTracker.NET . If you are thinking of adding full-text search to your application, you might find
分布式知识的总结(V1.0) [转]
1:分布式理论 CAP(Eric Brewer) Web服务无法同时满足以下3个属性 Consistency(一致性),数据一致更新,所有数据变动都是同步的 Availability(可用性),每个操作都必须以可预期的响应结束 Partition tolerance(分区容错性),即使出现单个组件无法可用,操作依然可以完成 在任何数据库设计中,一个Web应用至多只能同时支持上面
使用Hibernate+solr取代hibernate search
尝试使用solr取代hibernate search的方法,因本人对二者没有全面的了解,对二者都只是使用API级别,本文仅供参考。 hibernate 4.1,solr3.6.0 本人已实现, 1.开箱即用的solr,就象hibernate search,只要引入JAR包,会自己注入事件,当sessionFactory初始化结束后,即更新schema.xml. 2.hibernate进行inser
solr faceted search
Faceted Search with Solr Posted byyonik Faceted search has become a critical feature for enhancing findability and the user search experience for all types of search applications. In this articl
Hadoop0.20+ custom MultipleOutputFormat
Hadoop0.20.2中无法使用MultipleOutputFormat,多文件输出这个方法。尽管0.19.2中的方法老的方法org.apache.hadoop.mapred.lib.MultipleOutputFormat还是可以继续在0.20.2中使用,但是org.apache.hadoop.mapred下的方法都是标记为“已过时”,在hadoop下个版本中可能就不能使用了。hadoop 0
sharepoint 2010 页面添加footer方法 custom footer for sharepoint 2010 master page
转:http://blog.csdn.net/chenxinxian/article/details/8720893 在sharepoint 2010的页面中,我们发现,没有页尾,如果我们需要给页面添加一个footer,有一个办法可以做到,就是修改母板页master page, 如果默认的母板页是v4.master,那就修改v4.master。 在v4.master中找到“<SharePoi
nutch
CREATE DATABASE news DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci; CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int
最新教程
更多java线程状态详解(6种)
java线程类为:java.lang.Thread,其实现java.lang.Runnable接口。 线程在运行过程中有6种状态,分别如下: NEW:初始状态,线程被构建,但是还没有调用start()方法 RUNNABLE:运行状态,Java线程将操作系统中的就绪和运行两种状态统称为“运行状态” BLOCK:阻塞状态,表示线程阻塞
redis从库只读设置-redis集群管理
默认情况下redis数据库充当slave角色时是只读的不能进行写操作,如果写入,会提示以下错误:READONLY You can't write against a read only slave. 127.0.0.1:6382> set k3 111 (error) READONLY You can't write against a read only slave. 如果你要开启从库
Netty环境配置
netty是一个java事件驱动的网络通信框架,也就是一个jar包,只要在项目里引用即可。
Netty基于流的传输处理
在TCP/IP的基于流的传输中,接收的数据被存储到套接字接收缓冲器中。不幸的是,基于流的传输的缓冲器不是分组的队列,而是字节的队列。 这意味着,即使将两个消息作为两个独立的数据包发送,操作系统也不会将它们视为两个消息,而只是一组字节(有点悲剧)。 因此,不能保证读的是您在远程定入的行数据
Netty入门实例-使用POJO代替ByteBuf
使用TIME协议的客户端和服务器示例,让它们使用POJO来代替原来的ByteBuf。
Netty入门实例-时间服务器
Netty中服务器和客户端之间最大的和唯一的区别是使用了不同的Bootstrap和Channel实现
Netty入门实例-编写服务器端程序
channelRead()处理程序方法实现如下
Netty开发环境配置
最新版本的Netty 4.x和JDK 1.6及更高版本
电商平台数据库设计
电商平台数据库表设计:商品分类表、商品信息表、品牌表、商品属性表、商品属性扩展表、规格表、规格扩展表
HttpClient 上传文件
我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时,添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来,使用RequestBuilder创建一个HTTP请求,并分配先前创建的HttpEntity。
MongoDB常用命令
查看当前使用的数据库 > db test 切换数据库 > use foobar switched to db foobar 插入文档 > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new
快速了解MongoDB【基本概念与体系结构】
什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。
windows系统安装MongoDB
安装 下载MongoDB的安装包:mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi,按照提示步骤安装即可。 安装完成后,软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe,为了方便我们每次启动,我
Spring boot整合MyBatis-Plus 之二:增删改查
基于上一篇springboot整合MyBatis-Plus之后,实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId; import com.baomidou.mybatisplus.annotations.TableName; import com.baom
分布式ID生成器【snowflake雪花算法】
基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点: 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案