View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000774YaCyWishlist - Wunschlistepublic2017-12-27 16:322021-02-27 17:17
Assigned To 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000774: [Feature request] Add crawler filter based on document language
DescriptionCurrently the crawler has a raw localization filter based on the IP country code of the host server; an improvement would be to also offer a filter based on the actual language detected from the document content.

_ Main pros:
it would be easier to setup a large crawler job aimed at one specific language spanning a wide array of domain names and subdomains without having to specify individual regex filter rules for each of those domains, since domain names occasionally contain a country code; also consider that the crawler may be allowed to extent to domain names or hosts not initially listed in the crawler job specification, therefore making it harder to specify such rules.

The current filter based on country code may also be inadequate as a single host may offer documents in multiple languages.

_ Main cons:
a simple DNS lookup wouldn't be enough to execute the filter rules, as the document needs to be downloaded and parsed in order to pass it thru the filter.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
luc (reporter)
2018-06-22 09:06
edited on: 2018-06-22 09:06

Since this commit (https://github.com/yacy/yacy_search_server/commit/cced94298ab946125bc29e58583431ac4dd6a426 [^]) you can now set up a generic document filter for your crawl using Solr syntax. The new field is named "Solr query filter on any active indexed field(s)" in the /CrawlStartExpert.html page.

For example, to only add to your index crawled documents detected as German language, you can use this filter query : language_s:de

quee9899 (reporter)
2021-02-27 17:17

https://www.webreviewsite.com [^]
All the faces in the world are mirrors. What kind of reflections do you see in the faces of the people you meet?

- Issue History
Date Modified Username Field Change
2017-12-27 16:32 Davide New Issue
2018-06-22 09:06 luc Note Added: 0001498
2018-06-22 09:06 luc Note Edited: 0001498 View Revisions
2021-02-27 17:17 quee9899 Note Added: 0001531

Copyright © 2000 - 2021 MantisBT Team
Powered by Mantis Bugtracker