jGetFile

small logo

Usage

usage: java -jar jgetfile.jar [-i addresses] [-d ] [-h] [-r ] [-md ] [-e addresses] [-mc
       ] [-de] [-s relative path] [-iU regular expression] [-re regular
       expression] [-ext ] [-df bytes] [-ct ] [-it ] [-dir directory path] [-als
       BeanShell script] [-ffn]
 -r,--root-address                              Root web address to begin
                                                traversing from
 -md,--max-downloads-per-conn                   Specifies the maximum
                                                number of concurrent downloads per connection, defaults to 2.
 -dir,--download-dir <directory path>           Directory to download all
                                                files too.
 -e,--exclude-domains <addresses>               Excludes links that start
                                                with the specified values, and traverses all others.
 -i,--include-domains <addresses>               Includes and traverses
                                                ONLY links that start with the specified values.
 -ext,--extensions                              File extensions to
                                                download, ex. -ext .pdf,.doc.,.txt
 -re,--regexp-filter <regular expression>       Specify a regular
                                                expression to filter incoming links.
 -s,--start-from <relative path>                Specifies the relative
                                                path to start traversing from. Defaults to /
 -mc,--max-conn-to-hosts                        Specifies the maximum
                                                number of connections to open to unique hosts, defaults to 1.
 -ffn,--flat-file-namer                         Names downloaded files by
                                                the scheme: domain_path_file
 -als,--accept-link-script <BeanShell script>   The user created/specified
                                                BeanShell script determining whether an internal link is traversed or not.
 -df,--delete-files-less-than <bytes>           Deletes all files less
                                                than the given number of bytes.
 -ct,--crawler-threads                          Number of crawler threads
                                                to consume new links, defaults to 5.
 -de,--dynamically-exclude                      Prevents the Crawler from
                                                re-traversing non html links, ex. Extensions to search for is pdf, and a
                                                doc is encountered, doc goes on the list of excludes for traversal.
 -iU,--inner-url <regular expression>           Specify a regular
                                                expression to parse out inner urls.  Defaults to ".*(url=|u=)" which in
                                                jGetFile will match text after the last url= or u= parameters in the main
                                                url.
 -d,--depth                                     Depth to traverse down
                                                from root address
 -h,--help                                      Display help and exit.
 -it,--iterations                               Max links to traverse.

-r -r specifies the root address to begin traversing from. The -r option is required and must be a valid address. Although jGetFile is optimized for http traversal, it can additionally traverse ftp sites, or download ftp links off of http sites. Currently, authentication is not implemented, so ftp authentication is done anonymously.

-md -md specifies the concurrent number of downloads to attempt per connection. For example, if I have 100 files scheduled for download from http://www.getfiles.com by default only 2 will be downloaded at any one time, this can seem like bad performance, depending on needs. Therefore, if the bandwidth is available, this number should be increased.

-dir -dir specifies the local directory to transfer files too. If no directory is indicated, the current directory will be used.

-e -e indicates to jGetFile that the specified addresses are not to be traversed. For example, if one specifies

-e http://www.a.com,http://www.b.com
then jGetFile will not traverse any links that start with those addresses. Furthermore, the specified addresses do not have to be fully qualified addresses, the filter is based on if the link in question starts with one of the specified -e addresses. So, if the link in question is http://getfiles.com and we passed in the argument
-e http://g
then jGetFile would not traverse this link.

-i -i indicates to jGetFile that only links beginning with the specified addresses are the be traversed. For example, if one specifies

-i http://www.getfiles.com/a,http://www.getfiles/b
then only exclusively will addresses beginning with those specified addresses be traversed. If, for instance, a link with address http://www.getfiles.com/c is encountered, it will be skipped over, however, if a link with address http://www.getfiles.com/bat is encountered, it will be accepted.

-ext -ext specifies which file extensions to download. At least one extension must be specified. Ex.

-ext .zip
or multiple extensions,
-ext .txt,.doc,.pdf,.avi,.zip

-s -s specifies which relative address based off of the root address -r that jGetFile starts from. Ex. if our root address is http://getfiles.com and we want to begin traversing links from /softwarefiles/windowsapps then we would do the following

-s /softwarefiles/windowsapps
Note that unless link filtering is specified, if the first link encountered by jGetFile is http://www.getfiles.com, then using the -s convention serves no purpose. Each mass downloading case is different, as each web site is implemented differently. For the best results, many times a determination of the web site's conventions need to be made, and filters set up accordingly.

-mc -mc specifies the maximum number of concurrent connections to make to unique hosts specifically when downloading from the unique hosts. Do not confuse this is maximum number of crawler threads. If -mc is not specified, it will default to 1. This means that at any one time, there will only be 1 connection made for downloading files. So if there are 1000 files queued for download from 500 unique hosts, only one unique host will be processed at a time downloading an average of 2 files. Using the default of 1 is typically not recommended, unless using dial-up. Also, for example, if the -md option is set to 5, and the -mc option is set to 5, this means that given the queue has more than 5 unique hosts and more than 5 files queued per unique host, then jGetFile will be downloading 25 files at a time.

-ffn -ffn specifies that the Flat File naming scheme be used for downloading files. This is currently the only file naming scheme implemented with more to come in the future. It concatenates the host path and file values separated by an underscore. Ex. If jGetFile is downloading http://www.getfiles.com/windowsapps/killerapp.exe the the name that will show up after downloading will be www.getfiles.com_windowsapps_killerapp.exe

-df The -df options is useful if one wants to remove files downloaded less than a certain number of bytes. Ex.

-df 4000
will delete downloaded files that are less than 4 kilobytes.

-ct The -ct (Max Crawler Threads) options tells jGetFile to use n number of threads to internally parse links. Ex. If link a has 1000 links in it's html content, then with a -ct value of 1, then only 1 link at a time will be parsed to get the links contained in its html content. If a value of 5 is specified (the default) then 5 of those 1000 links will be simultaneously consumed. Note that setting this option to a high number will not always result in better performance. In the majority of cases, jGetFile will parse through links faster then can be retrieved from the web, so if -ct equals 20, and there are only 5 new links waiting to be consumed, only 5 threads will be active until more new links become available.

-de The -de option tells jGetFile to dynamically exclude traversing non html links. When a link is encountered and is about to be traversed, if the Content-Type of the html response is not text/html then the extension of the link will remembered, and it will not be traversed in the future. Ex. if we only specified extensions to .doc, yet we ran across a .pdf link, instead of attempting to traverse the .pdf links every time, with the -de option enabled, no more links ending with .pdf will be allowed in the queue.

-d -d specifies the depth at which to traverse from the root address. This option is critical for the majority of mass downloading scenarios. When the -d option is specified, jGetFile internally uses a tree based model to represent the links. If the -d option is not specified, then jGetFile uses an unbounded iterations based model. This means that links will be traversed until jGetFile is killed. Ex. If -d is set to 2, then only files from links from the root address will be downloaded. If -d is set to 1, only files from the root address will be downloaded.

-it The -it option specifies the maximum number of links to traverse. Ex.

-it 100 
will tell jGetFile to only follow the first 100 links it encounters, then stop accepting more links, finish queued downloads, and then exit. Internally using the -it option uses a Max Iterations Model, whereas no -it option or -d option tells jGetFile to use an Unbounded Iterations Model. That means to traverse links until the user kills the program.

-als The -als option allows for user specified BeanShell script to be used in determining whether a link will be traversed or discarded. This allows for arbitrarily complex rules to be specified for accepting or rejecting links, like accepting only links that begin with www.blah.com, excluding links that begin with www.foo.com, and excluding links that contain word cat, while accepting links that contain the word dog. The code for that would be:

if(link.startsWith("http://www.blah.com") 
	&& !(link.startsWith("http://www.foo.com"))
	&& !(link.contains("cat"))
	&& link.contains("dog")
  )		
  acceptLink = true;
else
  acceptLink = false;
Currently, there are 2 variables passed to the script to be evaluated, link and origin. Link specifies the current link in question. Origin specifies the link that the current link came from. The script must set the variable acceptLink to yes no true or false to be evaluated accordingly, anything else will be evaluated automatically to false. In the future more variables will be pushed to the script, such as the depth of the link if using the -d option.

-re The -re option allows a regular expression to be used in determining whether a link will be traversed or discarded. Ex. -re "http://www\.gnarly.*" will match all links beginning with http://www.gnarly Note: must surround the regular expression with double quotes on the command line.

-iU The -iU option allows a regular expression to be specified for parsing out inner urls from hrefs. Ex. href in question is http://www.downloadmyfiles.com?url=http://thereallink.com. By default, this case would be supported because the default regular expression is ".*(url=|u=)" which would parse out http://thereallink.com. However, say there is a unique case jGetFile doesn't handle by default, such as, http://www.downloadmyfiles.com?referrer=http://thereallink.com. Then, you could specify a custom regular expression that would handle this case such as ".*(url|u|referrer)" Note: must surround the regular expression with double quotes on the command line.

-h Prints the help as seen at the top of this page.

About Us | Site Map | Privacy Policy | Contact Us | ©2006 Samuel Mendenhall SourceForge.net LogoSupport This Project