Re: [Bug-wget] Hello again
Hello Darshit Shah, Converting a CMS system to static HTML pages is not a solution that suite all. Some sites which want to be 'dynamic' and retain "backward flik-flak" abilities might not use wget2 and retain their CMS or software behavior. Many people creating a website use CMS to generate the site because of its abilities to retain uniform website and make every change in GUI site-wide. Those people might want to have the static website as it is faster to download (Google SEO factor) and much more secure - hiding the CMS location and preventing login attempts. If those people would want to retain features as RSS feeds, we might be able to tell them how they can have it. If a website contains some hidden pages that are connected by JavaScript code, the programmer might create a shell script calling wget2 specifying each hidden page location. Have a good weekend! Michael -Original Message- From: 'Darshit Shah' Sent: Thursday, 11 October, 2018 12:35 PM To: mich...@cyber-dome.com Cc: bug-wget@gnu.org Subject: Re: [Bug-wget] Hello again * mich...@cyber-dome.com [181009 17:12]: > > Hello Darshit Shah, > > Thank you for your welcome message. I am glad to be part of your project! > > I don't understand the term "javascript engine". AFAK javascript is code that > run on the browser side, and we have no problem fetching it. > Exactly! Javascript is code that is executed on the client side and hence requires a javascript engine which interprets the code and executes it. However, Wget does not and will not package a javscript engine in order to run those scripts. This means, sites where Javascript is used to create hyperlinks won't work well when scraped through Wget. > > There might be an "ajax" issues with sites rely on it. Ajax is dealt heavy by > programmers and they will have to take some action on their site to > incorporate the engine. Similarly, sites that use Javascript to show menus or create AJAX requests are usually not amenable to being scraped as a static HTML page. > > POST requests to comments and mail will need to taken care of so they will > work on static site. One solution is to do hosted supplier that will carry > the task and deliver spam removal as well. > I think I will be able to a howto document on that. > > Michael > > -Original Message- > From: Darshit Shah > Sent: Tuesday, 9 October, 2018 2:52 PM > To: mich...@cyber-dome.com > Cc: bug-wget@gnu.org > Subject: Re: [Bug-wget] Hello again > > Hi Michael, > > Nice to hear from you again. I vaguely remember a mention of someone who > wanted > to work on this feature. When deciding to make this work, please remember that > any of this can only work if the site does not rely on Javascript; which given > Wordpress is a difficult thing. The reason for this is that we do _not_ intend > to ship a javascript engine alongwith Wget2. It is too large, unwieldy and too > much of a maintenance nightmare. However, if the site can work without > Javascript, then I would assume that Wget2 can already handle making a static > copy. If it can't handle something, please let us know / file a bug report > about it. > > Of course, I welcome you to work on Wget2 as you see fit. And we would love to > look at any contributions you can make. We will also try and help you out as > much as possible when dealing with the codebase. > > About the dev setup, I only use vim and gdb to work with Wget. As Tim has > already mentioned, he uses Netbeans and might be able to help you out. > > You also mentioned something about the lib/ directory. That is an > auto-generated dir with compatibility libs that you don't need to care about. > All the code for Wget2 is in src/ and the code for the library is in libwget/. > Those are the two main directories you need to care about. And of course > tests/ > for the tests. > > * mich...@cyber-dome.com [181008 21:22]: > > > > Hello again, > > > > My name is Michael. I have approached you about a year ago. > > > > I am interested in making wget2 a tool that can convert content management > > systems (like WordPress) output to HTML. This actually limits the content > > management system to generate the website every time it is changed, and the > > presentation is done using the HTTP server only. > > > > This is an important feature as it prevents security risk - penetration of > > hacker to the site and installing viruses or stealing data. > > It also allows the website to be delivered much faster as no PHP code needs > > to run in order to deliver the content. Google already announced that site > > download speed is a factor in its SEO evaluation. > > > > I will be able
Re: [Bug-wget] Hello again
* mich...@cyber-dome.com [181009 17:12]: > > Hello Darshit Shah, > > Thank you for your welcome message. I am glad to be part of your project! > > I don't understand the term "javascript engine". AFAK javascript is code that > run on the browser side, and we have no problem fetching it. > Exactly! Javascript is code that is executed on the client side and hence requires a javascript engine which interprets the code and executes it. However, Wget does not and will not package a javscript engine in order to run those scripts. This means, sites where Javascript is used to create hyperlinks won't work well when scraped through Wget. > > There might be an "ajax" issues with sites rely on it. Ajax is dealt heavy by > programmers and they will have to take some action on their site to > incorporate the engine. Similarly, sites that use Javascript to show menus or create AJAX requests are usually not amenable to being scraped as a static HTML page. > > POST requests to comments and mail will need to taken care of so they will > work on static site. One solution is to do hosted supplier that will carry > the task and deliver spam removal as well. > I think I will be able to a howto document on that. > > Michael > > -Original Message- > From: Darshit Shah > Sent: Tuesday, 9 October, 2018 2:52 PM > To: mich...@cyber-dome.com > Cc: bug-wget@gnu.org > Subject: Re: [Bug-wget] Hello again > > Hi Michael, > > Nice to hear from you again. I vaguely remember a mention of someone who > wanted > to work on this feature. When deciding to make this work, please remember that > any of this can only work if the site does not rely on Javascript; which given > Wordpress is a difficult thing. The reason for this is that we do _not_ intend > to ship a javascript engine alongwith Wget2. It is too large, unwieldy and too > much of a maintenance nightmare. However, if the site can work without > Javascript, then I would assume that Wget2 can already handle making a static > copy. If it can't handle something, please let us know / file a bug report > about it. > > Of course, I welcome you to work on Wget2 as you see fit. And we would love to > look at any contributions you can make. We will also try and help you out as > much as possible when dealing with the codebase. > > About the dev setup, I only use vim and gdb to work with Wget. As Tim has > already mentioned, he uses Netbeans and might be able to help you out. > > You also mentioned something about the lib/ directory. That is an > auto-generated dir with compatibility libs that you don't need to care about. > All the code for Wget2 is in src/ and the code for the library is in libwget/. > Those are the two main directories you need to care about. And of course > tests/ > for the tests. > > * mich...@cyber-dome.com [181008 21:22]: > > > > Hello again, > > > > My name is Michael. I have approached you about a year ago. > > > > I am interested in making wget2 a tool that can convert content management > > systems (like WordPress) output to HTML. This actually limits the content > > management system to generate the website every time it is changed, and the > > presentation is done using the HTTP server only. > > > > This is an important feature as it prevents security risk - penetration of > > hacker to the site and installing viruses or stealing data. > > It also allows the website to be delivered much faster as no PHP code needs > > to run in order to deliver the content. Google already announced that site > > download speed is a factor in its SEO evaluation. > > > > I will be able to work for 3 hours every week on the project. I do need some > > guidance from you. > > > > I have started to configure Netbeans IDE as using a debugger can help me > > delve into the code much faster. There are some issues with the Netbeans. Do > > you use Id? Which one? > > > > Best regards, > > > > Michael > > > > > > > > > > -- > Thanking You, > Darshit Shah > PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6 > > -- Thanking You, Darshit Shah PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6 signature.asc Description: PGP signature
Re: [Bug-wget] Hello again
Hello Darshit Shah, Thank you for your welcome message. I am glad to be part of your project! I don't understand the term "javascript engine". AFAK javascript is code that run on the browser side, and we have no problem fetching it. There might be an "ajax" issues with sites rely on it. Ajax is dealt heavy by programmers and they will have to take some action on their site to incorporate the engine. POST requests to comments and mail will need to taken care of so they will work on static site. One solution is to do hosted supplier that will carry the task and deliver spam removal as well. I think I will be able to a howto document on that. Michael -Original Message- From: Darshit Shah Sent: Tuesday, 9 October, 2018 2:52 PM To: mich...@cyber-dome.com Cc: bug-wget@gnu.org Subject: Re: [Bug-wget] Hello again Hi Michael, Nice to hear from you again. I vaguely remember a mention of someone who wanted to work on this feature. When deciding to make this work, please remember that any of this can only work if the site does not rely on Javascript; which given Wordpress is a difficult thing. The reason for this is that we do _not_ intend to ship a javascript engine alongwith Wget2. It is too large, unwieldy and too much of a maintenance nightmare. However, if the site can work without Javascript, then I would assume that Wget2 can already handle making a static copy. If it can't handle something, please let us know / file a bug report about it. Of course, I welcome you to work on Wget2 as you see fit. And we would love to look at any contributions you can make. We will also try and help you out as much as possible when dealing with the codebase. About the dev setup, I only use vim and gdb to work with Wget. As Tim has already mentioned, he uses Netbeans and might be able to help you out. You also mentioned something about the lib/ directory. That is an auto-generated dir with compatibility libs that you don't need to care about. All the code for Wget2 is in src/ and the code for the library is in libwget/. Those are the two main directories you need to care about. And of course tests/ for the tests. * mich...@cyber-dome.com [181008 21:22]: > > Hello again, > > My name is Michael. I have approached you about a year ago. > > I am interested in making wget2 a tool that can convert content management > systems (like WordPress) output to HTML. This actually limits the content > management system to generate the website every time it is changed, and the > presentation is done using the HTTP server only. > > This is an important feature as it prevents security risk - penetration of > hacker to the site and installing viruses or stealing data. > It also allows the website to be delivered much faster as no PHP code needs > to run in order to deliver the content. Google already announced that site > download speed is a factor in its SEO evaluation. > > I will be able to work for 3 hours every week on the project. I do need some > guidance from you. > > I have started to configure Netbeans IDE as using a debugger can help me > delve into the code much faster. There are some issues with the Netbeans. Do > you use Id? Which one? > > Best regards, > > Michael > > > > -- Thanking You, Darshit Shah PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
Re: [Bug-wget] Hello again
Thank you! -Original Message- From: Tim Rühsen Sent: Tuesday, 9 October, 2018 10:55 AM To: mich...@cyber-dome.com; bug-wget@gnu.org Subject: Re: [Bug-wget] Hello again On 10/8/18 10:27 PM, mich...@cyber-dome.com wrote: > The issues that I have is this: > > Since the source code is split in various directories (src, lib) the Netbeans > lose track of source code in the lib directory. > I verified it using gdb. (You can see how dip I went). lib/ is a automatically created directory (gnulib stuff, created by 'bootstrap') and normally you are not interested in it's contents. You might have the same issue with the test directories and fuzz/. I normally right click on the file I am interested in and enable 'Code Assistance'. > > So, can you send me your Netbeans project settings? Not the private/ stuff, but here is nbproject/configurations.xml and nbproject/project.xml. Regards, Tim
Re: [Bug-wget] Hello again
Hi Michael, Nice to hear from you again. I vaguely remember a mention of someone who wanted to work on this feature. When deciding to make this work, please remember that any of this can only work if the site does not rely on Javascript; which given Wordpress is a difficult thing. The reason for this is that we do _not_ intend to ship a javascript engine alongwith Wget2. It is too large, unwieldy and too much of a maintenance nightmare. However, if the site can work without Javascript, then I would assume that Wget2 can already handle making a static copy. If it can't handle something, please let us know / file a bug report about it. Of course, I welcome you to work on Wget2 as you see fit. And we would love to look at any contributions you can make. We will also try and help you out as much as possible when dealing with the codebase. About the dev setup, I only use vim and gdb to work with Wget. As Tim has already mentioned, he uses Netbeans and might be able to help you out. You also mentioned something about the lib/ directory. That is an auto-generated dir with compatibility libs that you don't need to care about. All the code for Wget2 is in src/ and the code for the library is in libwget/. Those are the two main directories you need to care about. And of course tests/ for the tests. * mich...@cyber-dome.com [181008 21:22]: > > Hello again, > > My name is Michael. I have approached you about a year ago. > > I am interested in making wget2 a tool that can convert content management > systems (like WordPress) output to HTML. This actually limits the content > management system to generate the website every time it is changed, and the > presentation is done using the HTTP server only. > > This is an important feature as it prevents security risk - penetration of > hacker to the site and installing viruses or stealing data. > It also allows the website to be delivered much faster as no PHP code needs > to run in order to deliver the content. Google already announced that site > download speed is a factor in its SEO evaluation. > > I will be able to work for 3 hours every week on the project. I do need some > guidance from you. > > I have started to configure Netbeans IDE as using a debugger can help me > delve into the code much faster. There are some issues with the Netbeans. Do > you use Id? Which one? > > Best regards, > > Michael > > > > -- Thanking You, Darshit Shah PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6 signature.asc Description: PGP signature
Re: [Bug-wget] Hello again
On 10/8/18 10:27 PM, mich...@cyber-dome.com wrote: > The issues that I have is this: > > Since the source code is split in various directories (src, lib) the Netbeans > lose track of source code in the lib directory. > I verified it using gdb. (You can see how dip I went). lib/ is a automatically created directory (gnulib stuff, created by 'bootstrap') and normally you are not interested in it's contents. You might have the same issue with the test directories and fuzz/. I normally right click on the file I am interested in and enable 'Code Assistance'. > > So, can you send me your Netbeans project settings? Not the private/ stuff, but here is nbproject/configurations.xml and nbproject/project.xml. Regards, Tim check_url_types.c getstream.c http_get.c http_get2.c http_multi_get.c print_css_urls.c print_css_urls2.c print_css_urls3.c print_html_urls.c websequencediagram.c websequencediagram_high.c libwget_base64_fuzzer.c libwget_hpkp_fuzzer.c libwget_hsts_fuzzer.c libwget_netrc_fuzzer.c libwget_ocsp_fuzzer.c libwget_tlssess_fuzzer.c libwget_utils_fuzzer.c main.c wget_http_client_fuzzer.c wget_options_fuzzer.c wget_skip_fuzzer.c cond.c lock.c thread.c threadlib.c scratch_buffer_grow.c scratch_buffer_grow_preserve.c scratch_buffer_set_array_size.c asnprintf.c basename-lgpl.c basename.c binary-io.c c-ctype.c c-strcasecmp.c c-strcasestr.c c-strncasecmp.c cloexec.c dirname-lgpl.c dirname.c dup-safer-flag.c dup-safer.c exitfail.c fatal-signal.c fclose.c fcntl.c fd-hook.c fd-safer-flag.c fd-safer.c fflush.c fpurge.c freading.c fseek.c fseeko.c getprogname.c gettime.c glob.c glob_pattern_p.c globfree.c hard-locale.c ioctl.c localcharset.c localename.c localtime-buffer.c malloca.c mbrtowc.c md2.c md5.c nanosleep.c pipe-safer.c pipe2-safer.c pipe2.c printf-args.c printf-parse.c progname.c safe-write.c sha1.c sha256.c sha512.c sig-handler.c sockets.c spawn-pipe.c stat-time.c stripslash.c strnlen1.c sys_socket.c tempname.c timespec.c u64.c unistd.c utimens.c vasnprintf.c wait-process.c wctype-h.c xalloc-die.c xmalloc.c xsize.c xstrndup.c atom_url.c bar.c base64.c bitmap.c buffer.c buffer_printf.c console.c cookie.c css.c css_tokenizer.c css_url.c decompressor.c dns.c dns_cache.c encoding.c error.c hash_printf.c hashfile.c hashmap.c hpkp.c hsts.c html_url.c http.c http_highlevel.c http_parse.c init.c io.c ip.c iri.c list.c log.c logger.c mem.c metalink.c net.c netrc.c ocsp.c pipe.c plugin.c printf.c random.c robots.c rss_url.c sitemap_url.c ssl_gnutls.c stringmap.c strlcpy.c strscpy.c test_linking.c thread.c tls_session.c utils.c vector.c xalloc.c xml.c bar.c blacklist.c dl.c gpgme.c host.c job.c log.c options.c plugin.c stats.c stats_dns.c stats_ocsp.c stats_server.c stats_site.c stats_tls.c testing.c utils.c wget.c libtest.c test--exclude-directories1.c test--filter-mime-type.c test--https-enforce-hard1.c test--https-enforce-hard2.c test--https-enforce-hard3.c test--https-enforce-soft1.c test--https-enforce-soft2.c test--https-enforce-soft3.c test--page-requisites.c test--save-content-on.c test--spider-r.c test-base.c test-gpg-styles.c test-i-https.c test-include-and-exclude-directories.c test-p-np.c test-plugin-dummy.c test-plugin.c test-stats-dns.c test-stats.c test-dl-dummy.c test-dl.c test.c Makefile
Re: [Bug-wget] Hello again
The issues that I have is this: Since the source code is split in various directories (src, lib) the Netbeans lose track of source code in the lib directory. I verified it using gdb. (You can see how dip I went). So, can you send me your Netbeans project settings? Thank you, Michael -Original Message- From: Tim Rühsen Sent: Monday, 8 October, 2018 10:55 PM To: mich...@cyber-dome.com; bug-wget@gnu.org Subject: Re: [Bug-wget] Hello again On 10/8/18 7:57 PM, mich...@cyber-dome.com wrote: > > Hello again, > > My name is Michael. I have approached you about a year ago. > > I am interested in making wget2 a tool that can convert content management > systems (like WordPress) output to HTML. This actually limits the content > management system to generate the website every time it is changed, and the > presentation is done using the HTTP server only. > > This is an important feature as it prevents security risk - penetration of > hacker to the site and installing viruses or stealing data. > It also allows the website to be delivered much faster as no PHP code needs > to run in order to deliver the content. Google already announced that site > download speed is a factor in its SEO evaluation. > > I will be able to work for 3 hours every week on the project. I do need some > guidance from you. > > I have started to configure Netbeans IDE as using a debugger can help me > delve into the code much faster. There are some issues with the Netbeans. Do > you use Id? Which one? Id ? it ? I use stock Netbeans 8.2 from https://netbeans.org/downloads/ (the All option). But you can take the any 'version' and install the C/C++ plugin afterwards. These are my jdk packages installed: default-jdk 2:1.10-68 default-jdk-headless 2:1.10-68 openjdk-10-jdk:amd64 10.0.2+13-1 openjdk-10-jdk-headless:amd64 10.0.2+13-1 openjdk-10-jre:amd64 10.0.2+13-1 openjdk-10-jre-headless:amd64 10.0.2+13-1 openjdk-7-jre-lib 7u95-2.6.4-1 openjdk-8-demo 8u181-b13-1 openjdk-8-doc 8u181-b13-1 openjdk-8-jdk:amd64 8u181-b13-1 openjdk-8-jdk-headless:amd64 8u181-b13-1 openjdk-8-jre:amd64 8u181-b13-1 openjdk-8-jre-headless:amd64 8u181-b13-1 openjdk-8-source 8u181-b13-1 What issues do you have ? Regards, Tim
Re: [Bug-wget] Hello again
On 10/8/18 7:57 PM, mich...@cyber-dome.com wrote: > > Hello again, > > My name is Michael. I have approached you about a year ago. > > I am interested in making wget2 a tool that can convert content management > systems (like WordPress) output to HTML. This actually limits the content > management system to generate the website every time it is changed, and the > presentation is done using the HTTP server only. > > This is an important feature as it prevents security risk - penetration of > hacker to the site and installing viruses or stealing data. > It also allows the website to be delivered much faster as no PHP code needs > to run in order to deliver the content. Google already announced that site > download speed is a factor in its SEO evaluation. > > I will be able to work for 3 hours every week on the project. I do need some > guidance from you. > > I have started to configure Netbeans IDE as using a debugger can help me > delve into the code much faster. There are some issues with the Netbeans. Do > you use Id? Which one? Id ? it ? I use stock Netbeans 8.2 from https://netbeans.org/downloads/ (the All option). But you can take the any 'version' and install the C/C++ plugin afterwards. These are my jdk packages installed: default-jdk 2:1.10-68 default-jdk-headless 2:1.10-68 openjdk-10-jdk:amd64 10.0.2+13-1 openjdk-10-jdk-headless:amd64 10.0.2+13-1 openjdk-10-jre:amd64 10.0.2+13-1 openjdk-10-jre-headless:amd64 10.0.2+13-1 openjdk-7-jre-lib 7u95-2.6.4-1 openjdk-8-demo 8u181-b13-1 openjdk-8-doc 8u181-b13-1 openjdk-8-jdk:amd64 8u181-b13-1 openjdk-8-jdk-headless:amd64 8u181-b13-1 openjdk-8-jre:amd64 8u181-b13-1 openjdk-8-jre-headless:amd64 8u181-b13-1 openjdk-8-source 8u181-b13-1 What issues do you have ? Regards, Tim signature.asc Description: OpenPGP digital signature
[Bug-wget] Hello again
Hello again, My name is Michael. I have approached you about a year ago. I am interested in making wget2 a tool that can convert content management systems (like WordPress) output to HTML. This actually limits the content management system to generate the website every time it is changed, and the presentation is done using the HTTP server only. This is an important feature as it prevents security risk - penetration of hacker to the site and installing viruses or stealing data. It also allows the website to be delivered much faster as no PHP code needs to run in order to deliver the content. Google already announced that site download speed is a factor in its SEO evaluation. I will be able to work for 3 hours every week on the project. I do need some guidance from you. I have started to configure Netbeans IDE as using a debugger can help me delve into the code much faster. There are some issues with the Netbeans. Do you use Id? Which one? Best regards, Michael