YaCy-Bugtracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000606YaCy[All Projects] Generalpublic2015-10-10 18:322015-10-13 02:46
ReporterDavide 
Assigned ToBuBu 
PrioritynormalSeverityminorReproducibilityalways
StatusresolvedResolutionfixed 
ETAnone 
PlatformOSOS Version
Product VersionYaCy 1.8 
Target VersionFixed in Version 
Summary0000606: Recorded crawler mangled
DescriptionA registered crawler in Table_API_p.html fails to execute.

Clicking its "clone" icon, browser is redirected to CrawlStartExpert.html, where I notice that the "Use filter" input field is pre-filled with an url-encoded string. Once manually url-decoded, the string becomes my correct, valid regex and the crawler task can be successfully started again.
Steps To Reproduce1) Create crawler with regex filter;
2) Re-execute crawler from "Process Scheduler".
TagsNo tags attached.
Attached Filespdf file icon url-encoded.pdf [^] (102,930 bytes) 2015-10-12 11:01

- Relationships

-  Notes
(0001115)
BuBu (developer)
2015-10-11 22:50

were not able to reproduce it.
Used in all 4 filter fields

- Load Filter on URLs - Use filter
- Load Filter on IPs - must-match
- Filter on URLs - must-match
- Filter on Content of Document - must-match

some regex with chars which would be URL-encoded, but all were fine after clone button.

Maybe give a example of the regex you used and which of above filter you encountered it.
(0001117)
Davide (reporter)
2015-10-12 11:13

The uploaded page url-encoded.pdf shows what I get just after pressing the "clone" button from Table_API_p.html. Notice that the field with html ID "intention" ("Index Attributes" → "Do Remote Indexing") is url-encoded, too.

Here are the original, unencoded values I entered for each field, listed by their input ID:


crawlingURL:
http://www.amazon.com/ [^]
http://www.futureshop.ca/ [^]
http://www.newegg.com/ [^]
http://www.tigerdirect.com/ [^]
http://www.bestbuy.com/site/electronics/computers-pcs/abcat0500000.c?id=abcat0500000 [^]

mustmatch:
(\.(jpg|jpeg|gif|giff|png|tif|tiff)$)|(.*\bamazon.com(/.*)?)|(.*\bbestbuy.com(/.*)?)|(.*\bfutureshop.ca(/.*)?)|(.*\bnewegg.com(/.*)?)|(.*\btigerdirect.com(/.*)?)(.*\bhighspeedbackbone.net(/.*)?)|(.*\bbbystatic.com(/.*)?)

mustnotmatch:
(.*spanish.bestbuy.com.*)|(.*blog.*)|(.*forum.*)

intention:
Hardware product pages in YaCy index
(0001118)
BuBu (developer)
2015-10-13 02:46

that helped...
fixed in v1.83/9403

- Issue History
Date Modified Username Field Change
2015-10-10 18:32 Davide New Issue
2015-10-11 22:50 BuBu Note Added: 0001115
2015-10-12 11:01 Davide File Added: url-encoded.pdf
2015-10-12 11:13 Davide Note Added: 0001117
2015-10-13 02:46 BuBu Note Added: 0001118
2015-10-13 02:46 BuBu Status new => resolved
2015-10-13 02:46 BuBu Resolution open => fixed
2015-10-13 02:46 BuBu Assigned To => BuBu


Copyright © 2000 - 2019 MantisBT Team
Powered by Mantis Bugtracker