|Anonymous | Login | Signup for a new account||2020-01-24 07:25 CET|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000646||YaCy||[All Projects] General||public||2016-03-22 11:18||2016-09-01 15:54|
|Product Version||YaCy 1.8|
|Target Version||Fixed in Version|
|Summary||0000646: Refactoring of some postprocessing procedures|
|Description||It would be very great to improve some postprocessing procedures of YaCy since they're taking currently a very long time to complete.|
|Additional Information||Postprocessing Progress |
busy:postprocessed 8200 from 100556956 collection documents; 8 ppm; 12189368 minutes remaining
|Tags||No tags attached.|
|Attached Files|| postProcessProfiling.png [^] (113,996 bytes) 2016-08-26 13:55
The system load is very high during the process mentioned above:
load average: 10.54, 10.92, 10.72
These values representing load of 140% for my 8-core System
edited on: 2016-08-26 14:05
Hi, I attached a first profiling trace performed with VisualVM on a Debian Jessie machine, postprocessing 11240 recently crawled documents, with webgraph Solr core disabled.
With this config, the two main hotspots consuming processor time are clearly identified and are with no surprise related to Solr operations :
- in CollectionConfiguration.postprocessing_doublecontent : the internal parsing of the Solr query used to search for double documents (ExtendedDismaxQParser.parse - 41 %)
- in CollectionConfiguration.postprocessing : the partial update (ConcurrentUpdateSolrConnector.update() - 34,1 %) of each document
|Thank you very much for the detailed analysis. Using Debian Jessie here, too. Runs much smoother since upgrading to YaCy 1.9 and Java 8 using additional java args -XX:+UseParallelGC -XX:+UseNUMA in an mutiprocessor environment. I'm sure the new solr version is improved, too.|
I continued experimenting a little bit to see how much this can be improved.
Just to check, I disabled double documents search and modified how updates are performed : I grouped them by collections of 100 and even 1000 documents to avoid committing at each update.
With my small test documents number (11240) this approximately divides processing time by two... But no miracle, with this logic and millions of documents to process, the time would still be far too long.
I wonder if this processings could not be performed at another more appropriate place/time...
Unfortunately more extended profiling revealed that the idea of grouping commit updates is ineffective... The main real CPU burning hotspot is in the postprocessing_doublecontent query.
With my current knowledge I do not know how to drastically improve this...
By the way, for now I commited some refactorings to at least more easily understand the whole postprocessing algorithm, and profile performances with VisualVM for example.
See Pull Request : https://github.com/yacy/yacy_search_server/pull/71 [^]
|2016-03-22 11:18||LA_FORGE||New Issue|
|2016-03-22 11:29||LA_FORGE||Note Added: 0001230|
|2016-08-26 13:55||luc||File Added: postProcessProfiling.png|
|2016-08-26 14:03||luc||Note Added: 0001287|
|2016-08-26 14:05||luc||Note Edited: 0001287||View Revisions|
|2016-08-28 20:06||LA_FORGE||Note Added: 0001288|
|2016-08-30 09:44||luc||Note Added: 0001291|
|2016-09-01 15:54||luc||Note Added: 0001292|
|Copyright © 2000 - 2020 MantisBT Team|