Re: How to extract pdf content from a html page

Aalok Agrawal Sat, 26 Aug 2017 03:31:03 -0700

I mentioned one approach in my first email, then I mentioned another
approach in my last email. Both are not working. I meant to say that. URL
is not public, so won't be able to share. As per your suggestion, I opened
download.pdf in text editor & found that it is not a pdf but a login page
of my site.


So I have write a code to pass on credentials, so that it can proceed with
authentication. Is there a way to pass on credentials,using pdfbox API.

On Fri, Aug 25, 2017 at 9:41 PM, Tilman Hausherr <[email protected]>
wrote:

> Am 25.08.2017 um 11:45 schrieb Aalok Agrawal:
>
>> You got it right, PDF is within a www page. And it's URL is known & passed
>> as a variable (strURL) to the function. Another approach which I tried to
>> get the content of pdf rendered there, but that is also not working -
>>
>
>
> Is the URL public and freely available? If yes, please mention it so I can
> test.
>
> "but that is also not working" - what does that mean? Do you get an error
> message, nothing, a JVM crash, a BSOD, ...?
>
> What is in that "download.pdf" file? Is this a PDF or is it not? Does it
> start with "%PDF" or not if you open the file with NOTEPAD++?
>
> If it isn't, then it means that your PDF has a different URL. You'll have
> to look at the html / javascript source code to find out what is going on.
>
> Tilman
>
>
>
>
>
>> byte[] ba1 = new byte[1024];
>> int baLength;
>> FileOutputStream fos1 = new FileOutputStream("download.pdf");
>> URL url = new URL(strURL);
>> URLConnection urlConn = url.openConnection();
>>
>> InputStream is1 = url.openStream();
>>    while ((baLength = is1.read(ba1)) != -1) {
>>         fos1.write(ba1, 0, baLength);
>>         }
>> fos1.flush();
>> fos1.close();
>> is1.close();
>> pdDoc = PDDocument.load("download.pdf");
>> parsedText = pdfStripper.getText(pdDoc);
>>
>> On Fri, Aug 25, 2017 at 12:45 AM, Tilman Hausherr <[email protected]>
>> wrote:
>>
>> Am 24.08.2017 um 19:27 schrieb Aalok Agrawal:
>>>
>>> I have written following code -
>>>>
>>>> PDFTextStripper pdfStripper = null;
>>>> PDDocument pdDoc = null;
>>>> COSDocument cosDoc = null;
>>>> String parsedText = null;
>>>>
>>>> URL url = new URL(strURL);
>>>> BufferedInputStream file = new BufferedInputStream(url.openStream());
>>>> PDFParser parser = new PDFParser(file);
>>>>
>>>> parser.parse();
>>>> cosDoc = parser.getDocument();
>>>> pdfStripper = new PDFTextStripper();
>>>>
>>>> pdDoc = new PDDocument(cosDoc);
>>>> parsedText = pdfStripper.getText(pdDoc);
>>>>
>>>> But it is not fetching content of pdf embedded in browser.
>>>>
>>>> PDFBox can't communicate with your browser.
>>>
>>> url.openStream()
>>>
>>> means that the URL content is fetched.
>>>
>>> Could it be that the PDF is within a www page? I.e. HTML outside, and PDF
>>> in a smaller window / frame? Then you'd need to know that URL.
>>>
>>> Tilman
>>>
>>>
>>>
>>> On Thu, Aug 24, 2017 at 9:08 PM, Gilad Denneboom <
>>>> [email protected]>
>>>> wrote:
>>>>
>>>> If you don't know the file's URL or the path of the local temp file to
>>>>
>>>>> which it is saved I don't see how you could do it.
>>>>>
>>>>> On Thu, Aug 24, 2017 at 4:08 PM, Aalok Agrawal <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> I am working on an application where pdf is getting rendered in
>>>>>> browser.
>>>>>> There is no pdf extension in URL.
>>>>>>
>>>>>> I have to read the content of the pdf & check some text. Is there any
>>>>>> way
>>>>>> to do that.
>>>>>>
>>>>>> Thanks
>>>>>> Aalok Agrawal
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: How to extract pdf content from a html page

Reply via email to