YaCy-Bugtracker - YaCy
View Issue Details
0000774YaCyWishlist - Wunschlistepublic2017-12-27 16:322018-06-22 09:06
Davide 
 
normalminorN/A
newopen 
none 
 
 
0000774: [Feature request] Add crawler filter based on document language
Currently the crawler has a raw localization filter based on the IP country code of the host server; an improvement would be to also offer a filter based on the actual language detected from the document content.

_ Main pros:
it would be easier to setup a large crawler job aimed at one specific language spanning a wide array of domain names and subdomains without having to specify individual regex filter rules for each of those domains, since domain names occasionally contain a country code; also consider that the crawler may be allowed to extent to domain names or hosts not initially listed in the crawler job specification, therefore making it harder to specify such rules.

The current filter based on country code may also be inadequate as a single host may offer documents in multiple languages.

_ Main cons:
a simple DNS lookup wouldn't be enough to execute the filter rules, as the document needs to be downloaded and parsed in order to pass it thru the filter.
No tags attached.
Issue History
2017-12-27 16:32DavideNew Issue
2018-06-22 09:06lucNote Added: 0001498
2018-06-22 09:06lucNote Edited: 0001498bug_revision_view_page.php?bugnote_id=1498#r466

Notes
(0001498)
luc   
2018-06-22 09:06   
Since this commit (https://github.com/yacy/yacy_search_server/commit/cced94298ab946125bc29e58583431ac4dd6a426 [^]) you can now set up a generic document filter for your crawl using Solr syntax. The new field is named "Solr query filter on any active indexed field(s)" in the /CrawlStartExpert.html page.

For example, to only add to your index crawled documents detected as German language, you can use this filter query : language_s:de