Hi,
I'm trying to extract meta tags from webpages. I'm using the code below
but am finding that only a small subset of meta tags are being returned.
There are meta tags like those for facebook open graph that i am interested
in that are not being returned?
public static void extractMetaFromUrl(String url) throws Exception{
Any23 runner = new Any23();
runner.setHTTPUserAgent("test-user-agent");
HTTPClient httpClient = runner.getHTTPClient();
DocumentSource source = new HTTPDocumentSource(
httpClient,
url
);
ByteArrayOutputStream out = new ByteArrayOutputStream();
TripleHandler handler = new JSONWriter(out);
try {
//runner.extract(source, handler);
final ExtractionParameters extractionParameters =
ExtractionParameters.newDefault();
extractionParameters.setFlag("any23.extraction.head.meta", true);
SingleDocumentExtraction sde = new SingleDocumentExtraction(source,
new HTMLMetaExtractorFactory(),handler);
SingleDocumentExtractionReport rpt = sde.run(extractionParameters);
//System.out.println(rpt.toString());
} finally {
handler.close();
}
String n3 = out.toString("UTF-8");
System.out.println(n3);
}