Shell script help needed

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • acctman
    Confirmed User
    • Oct 2003
    • 2840

    #1

    Shell script help needed

    Can someone who knows shell scripting spot my problem, everything appears to be correct but it's returning no results.

    this is the html code that has the item_id code (ex: 55963573) that I need to collect
    Code:
    <a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
    onclick="itemPlayPlop.open(this.href); return false;">
    shell script
    Code:
    while read prodName;
    do
      wget -q -U Mozilla "http://www.domain.com/$prodName/" -O - \
      | tr '"' '\n' | grep "^?item_id=" | cut -d ' ' -f 4 >> itemIDs.txt
    done < catNames.txt
    thanks in advance
  • critical
    Confirmed User
    • Aug 2009
    • 478

    #2
    Check to make sure the domain you are querying is actually returning results to
    you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

    :-)

    Comment

    • acctman
      Confirmed User
      • Oct 2003
      • 2840

      #3
      Originally posted by critical
      Check to make sure the domain you are querying is actually returning results to
      you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

      :-)
      weird cause I used a similar code to get the product names

      Code:
      for page in {1..50}
      do
              wget -q -U Mozilla "http://www.domain.com/catalog_search/cat?p=$page" -O - \
               | tr '"' '\n' | grep "^Product photo for " | cut -d ' ' -f 4 >> catNames.txt
              sleep 15
      done

      Comment

      • V_RocKs
        Damn Right I Kiss Ass!
        • Nov 2003
        • 32449

        #4
        No idea how to help you without the data example.

        Comment

        • Barry-xlovecam
          It's 42
          • Jun 2010
          • 18083

          #5
          from the manual;

          ?-U agent-string?
          ?--user-agent=agent-string?
          Identify as agent-string to the http server.

          The http protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the www software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ?Wget/version?, version being the current version number of Wget.

          However, some sites have been known to impose the policy of tailoring the output according to the User-Agent-supplied information. While this is not such a bad idea in theory, it has been abused by servers denying information to clients other than (historically) Netscape or, more frequently, Microsoft Internet Explorer. This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

          Specifying empty user agent with ?--user-agent=""? instructs Wget not to send the User-Agent header in http requests.
          http://www.gnu.org/software/wget/man....html#Invoking

          Comment

          • acctman
            Confirmed User
            • Oct 2003
            • 2840

            #6
            Originally posted by V_RocKs
            No idea how to help you without the data example.
            this is the html line i'm interested in. i need to extract 55963573
            Code:
            <a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
            onclick="itemPlayPlop.open(this.href); return false;">

            Comment

            • raymor
              Confirmed User
              • Oct 2002
              • 3745

              #7
              It appears one problem is that you've anchored the grep:

              Code:
              grep "^?item_id="
              In your example "?item_id" isn't the beginning of a line, so the ^ anchor means
              nothing matches. Also, remember ? is a metacharacter.

              You'll probably not get much more help without posting your actual code with the
              real URL so somebody can see what is going on. When you obfuscate things you may
              as well ask why this doesn't work:

              Code:
              some code
                 some more code 
               also code
              if code then
              do some stuff
              fi 
              < input I'm not showing you
              For historical display only. This information is not current:
              support&#64;bettercgi.com ICQ 7208627
              Strongbox - The next generation in site security
              Throttlebox - The next generation in bandwidth control
              Clonebox - Backup and disaster recovery on steroids

              Comment

              Working...