|Anonymous | Login | Signup for a new account||2020-07-07 21:21 CEST|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000774||YaCy||Wishlist - Wunschliste||public||2017-12-27 16:32||2019-07-28 08:38|
|Target Version||Fixed in Version|
|Summary||0000774: [Feature request] Add crawler filter based on document language|
|Description||Currently the crawler has a raw localization filter based on the IP country code of the host server; an improvement would be to also offer a filter based on the actual language detected from the document content.|
_ Main pros:
it would be easier to setup a large crawler job aimed at one specific language spanning a wide array of domain names and subdomains without having to specify individual regex filter rules for each of those domains, since domain names occasionally contain a country code; also consider that the crawler may be allowed to extent to domain names or hosts not initially listed in the crawler job specification, therefore making it harder to specify such rules.
The current filter based on country code may also be inadequate as a single host may offer documents in multiple languages.
_ Main cons:
a simple DNS lookup wouldn't be enough to execute the filter rules, as the document needs to be downloaded and parsed in order to pass it thru the filter.
|Tags||No tags attached.|
edited on: 2018-06-22 09:06
Since this commit (https://github.com/yacy/yacy_search_server/commit/cced94298ab946125bc29e58583431ac4dd6a426 [^]) you can now set up a generic document filter for your crawl using Solr syntax. The new field is named "Solr query filter on any active indexed field(s)" in the /CrawlStartExpert.html page.
For example, to only add to your index crawled documents detected as German language, you can use this filter query : language_s:de
|2017-12-27 16:32||Davide||New Issue|
|2018-06-22 09:06||luc||Note Added: 0001498|
|2018-06-22 09:06||luc||Note Edited: 0001498||View Revisions|
|Copyright © 2000 - 2020 MantisBT Team|