RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
Thank you, Sebastian, for updated patch and pointer to the discussion. Joe -Original Message- From: Sebastian Nagel Sent: Wednesday, January 15, 2020 5:25 AM To: user@nutch.apache.org Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 Hi Joseph, sorry for the late reply. Anyway: the patch for NUTCH-2525 fixes your problem. See also my comments in https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&sdata=5NbDweVESvdJcZHmjJJvp7m3RLHL%2BxREGT40Sn%2B8YOQ%3D&reserved=0 Thanks, Sebastian On 1/2/20 2:55 PM, Gilvary, Joseph wrote: > Happy New Year, Sebastian, > > Thank you. That looks promising. Hope you enjoy the holiday! > > Joe > > -Original Message- > From: Sebastian Nagel > Sent: Thursday, January 2, 2020 7:42 AM > To: user@nutch.apache.org > Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hi Joseph, > > this could be related to > > https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu > es.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&data=02%7C01%7CJoseph.G > ilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540 > 268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&sdata=5NbDweVESvdJ > cZHmjJJvp7m3RLHL%2BxREGT40Sn%2B8YOQ%3D&reserved=0 > caused by not-all-lowercase meta keys. > > I'm happy to check whether the attached patch fixes your problem when I'm > back from holidays in a few days. > > Best, > Sebastian > > On 12/31/19 5:43 PM, Gilvary, Joseph wrote: >> Thanks, Markus, >> >> Those are the tools I've been using to debug because it's quicker than >> reindexing even a test collection in Solr. So parsechecker shows that these >> fields are in the parse metadata, but I can't figure out how to get them >> into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but >> the other namespaces using ':' aren't making it through and I'm at a loss. >> >> Nutch schema.xml: >> >> > name="xmpTPg_NPages" type="int" indexed="true" stored="true"/> >> >> nutch-site.xml: >> >> >> index.parse.md >> >> description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages >> >> >> >> >> Parsechecker sees the values for the xmp stuff: >> >> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 >> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version >> 5.2.2 access_permission:blah_blah_blah xmpTPg:NPages=23 >> access_permission:can_modify=true pdf:docinfo:producer=Acrobat >> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z >> >> >> Indexchecker doesn't: >> >> fetching: >> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127. >> 0 >> .01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9 >> c >> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1% >> 7 >> C637135657390462972&sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi% >> 2 >> FAOBXXM%3D&reserved=0 >> robots.txt whitelist not configured. >> parsing: >> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&sdata=MYOolP4hbljGmHBqDu3UVf%2B%2BNU2zX4VgmGPPQuqRRgc%3D&reserved=0 >> pdf:docinfo:title : Test File >> tstamp :Tue Dec 31 11:23:28 EST 2019 >> pdf:docinfo:modified : 2011-04-27T18:36:58Z >> pdf:docinfo:created : 2011-04-27T18:33:06Z >> >> >> The Dublin Core values don't use colon ':' but dot '.' and they show up >> fine. There are embedded spaces in some of the xmp values, but the >> pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm >> wondering if there's anything special about the pdf:docinfo that isn't >> generalized or is somehow configurable for generalization to other >> namespaces. >> >> Thanks, >> >> Joe >> >> -Original Message- >> From: Markus Jelsma >> Sent: Tuesday, December 31, 2019 8:30 AM >> To: user@nutch.apache.org >> Subject: RE: Extractin
Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
Hi Joseph, sorry for the late reply. Anyway: the patch for NUTCH-2525 fixes your problem. See also my comments in https://issues.apache.org/jira/browse/NUTCH-2525 Thanks, Sebastian On 1/2/20 2:55 PM, Gilvary, Joseph wrote: > Happy New Year, Sebastian, > > Thank you. That looks promising. Hope you enjoy the holiday! > > Joe > > -Original Message- > From: Sebastian Nagel > Sent: Thursday, January 2, 2020 7:42 AM > To: user@nutch.apache.org > Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hi Joseph, > > this could be related to > > https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013&sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3D&reserved=0 > caused by not-all-lowercase meta keys. > > I'm happy to check whether the attached patch fixes your problem when I'm > back from holidays in a few days. > > Best, > Sebastian > > On 12/31/19 5:43 PM, Gilvary, Joseph wrote: >> Thanks, Markus, >> >> Those are the tools I've been using to debug because it's quicker than >> reindexing even a test collection in Solr. So parsechecker shows that these >> fields are in the parse metadata, but I can't figure out how to get them >> into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but >> the other namespaces using ':' aren't making it through and I'm at a loss. >> >> Nutch schema.xml: >> >> > name="xmpTPg_NPages" type="int" indexed="true" stored="true"/> >> >> nutch-site.xml: >> >> >> index.parse.md >> >> description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages >> >> >> >> >> Parsechecker sees the values for the xmp stuff: >> >> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 >> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 >> access_permission:blah_blah_blah xmpTPg:NPages=23 >> access_permission:can_modify=true pdf:docinfo:producer=Acrobat >> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z >> >> >> Indexchecker doesn't: >> >> fetching: >> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0 >> .01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c >> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7 >> C637135657390462972&sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2 >> FAOBXXM%3D&reserved=0 >> robots.txt whitelist not configured. >> parsing: >> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972&sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3D&reserved=0 >> pdf:docinfo:title : Test File >> tstamp :Tue Dec 31 11:23:28 EST 2019 >> pdf:docinfo:modified : 2011-04-27T18:36:58Z >> pdf:docinfo:created : 2011-04-27T18:33:06Z >> >> >> The Dublin Core values don't use colon ':' but dot '.' and they show up >> fine. There are embedded spaces in some of the xmp values, but the >> pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm >> wondering if there's anything special about the pdf:docinfo that isn't >> generalized or is somehow configurable for generalization to other >> namespaces. >> >> Thanks, >> >> Joe >> >> -Original Message- >> From: Markus Jelsma >> Sent: Tuesday, December 31, 2019 8:30 AM >> To: user@nutch.apache.org >> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 >> >> Hello Joseph, >> >>> Is there more documentation on having Nutch get what Tika sees into what >>> Solr will see? >> >> No, but i believe you would want to checkout the parsechecker and >> indexchecker tools. These tools display what Tika sees and what will be sent >> to Solr. >> >> Regards, >> Markus >> >> -Original message- >>> From:Gilvary, Joseph >>> Sent: Tuesday 31st December 2019 14:19 >>> To: user@nutch.apache.org >>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 >>> >>> Happy New Year, >>> >>> I've searched the archives and the web as best I can, tinkered with >>> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the >>> parse metadata into the Solr (7.6) index. >>> >>> I want to index stuff like: >>> >>> xmp:CreatorTool=PScript5.dll Version 5.2.2 >>> xmpTPg:NPages=23 >>> >>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but >>> swapping out ':' for '_' isn't working for the xmp stuff. >>> >>> Is there more documentation on having Nutch get what Tika sees into what >>> Solr will see? >>> >>> Any help appreciated. >>> >>> Thanks, >>> >>> Joe >>> >
RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
Happy New Year, Sebastian, Thank you. That looks promising. Hope you enjoy the holiday! Joe -Original Message- From: Sebastian Nagel Sent: Thursday, January 2, 2020 7:42 AM To: user@nutch.apache.org Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 Hi Joseph, this could be related to https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013&sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3D&reserved=0 caused by not-all-lowercase meta keys. I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days. Best, Sebastian On 12/31/19 5:43 PM, Gilvary, Joseph wrote: > Thanks, Markus, > > Those are the tools I've been using to debug because it's quicker than > reindexing even a test collection in Solr. So parsechecker shows that these > fields are in the parse metadata, but I can't figure out how to get them into > the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the > other namespaces using ':' aren't making it through and I'm at a loss. > > Nutch schema.xml: > > name="xmpTPg_NPages" type="int" indexed="true" stored="true"/> > > nutch-site.xml: > > > index.parse.md > > description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages > > > > > Parsechecker sees the values for the xmp stuff: > > Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 > pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 > access_permission:blah_blah_blah xmpTPg:NPages=23 > access_permission:can_modify=true pdf:docinfo:producer=Acrobat > Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z > > > Indexchecker doesn't: > > fetching: > https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0 > .01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c > be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7 > C637135657390462972&sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2 > FAOBXXM%3D&reserved=0 > robots.txt whitelist not configured. > parsing: > https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972&sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3D&reserved=0 > pdf:docinfo:title : Test File > tstamp :Tue Dec 31 11:23:28 EST 2019 > pdf:docinfo:modified : 2011-04-27T18:36:58Z > pdf:docinfo:created : 2011-04-27T18:33:06Z > > > The Dublin Core values don't use colon ':' but dot '.' and they show up fine. > There are embedded spaces in some of the xmp values, but the > pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm > wondering if there's anything special about the pdf:docinfo that isn't > generalized or is somehow configurable for generalization to other > namespaces. > > Thanks, > > Joe > > -Original Message- > From: Markus Jelsma > Sent: Tuesday, December 31, 2019 8:30 AM > To: user@nutch.apache.org > Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hello Joseph, > >> Is there more documentation on having Nutch get what Tika sees into what >> Solr will see? > > No, but i believe you would want to checkout the parsechecker and > indexchecker tools. These tools display what Tika sees and what will be sent > to Solr. > > Regards, > Markus > > -Original message- >> From:Gilvary, Joseph >> Sent: Tuesday 31st December 2019 14:19 >> To: user@nutch.apache.org >> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 >> >> Happy New Year, >> >> I've searched the archives and the web as best I can, tinkered with >> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the >> parse metadata into the Solr (7.6) index. >> >> I want to index stuff like: >> >> xmp:CreatorTool=PScript5.dll Version 5.2.2 >> xmpTPg:NPages=23 >> >> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping >> out ':' for '_' isn't working for the xmp stuff. >> >> Is there more documentation on having Nutch get what Tika sees into what >> Solr will see? >> >> Any help appreciated. >> >> Thanks, >> >> Joe >>
Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
Hi Joseph, this could be related to https://issues.apache.org/jira/browse/NUTCH-2525 caused by not-all-lowercase meta keys. I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days. Best, Sebastian On 12/31/19 5:43 PM, Gilvary, Joseph wrote: > Thanks, Markus, > > Those are the tools I've been using to debug because it's quicker than > reindexing even a test collection in Solr. So parsechecker shows that these > fields are in the parse metadata, but I can't figure out how to get them into > the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the > other namespaces using ':' aren't making it through and I'm at a loss. > > Nutch schema.xml: > > > > > nutch-site.xml: > > > index.parse.md > > description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages > > > > > Parsechecker sees the values for the xmp stuff: > > Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 > pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 > access_permission:blah_blah_blah xmpTPg:NPages=23 > access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller > 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z > > > Indexchecker doesn't: > > fetching: http://127.0.01/test.pdf > robots.txt whitelist not configured. > parsing: http://127.0.01/test.pdf > pdf:docinfo:title : Test File > tstamp :Tue Dec 31 11:23:28 EST 2019 > pdf:docinfo:modified : 2011-04-27T18:36:58Z > pdf:docinfo:created : 2011-04-27T18:33:06Z > > > The Dublin Core values don't use colon ':' but dot '.' and they show up fine. > There are embedded spaces in some of the xmp values, but the > pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm > wondering if there's anything special about the pdf:docinfo that isn't > generalized or is somehow configurable for generalization to other > namespaces. > > Thanks, > > Joe > > -Original Message- > From: Markus Jelsma > Sent: Tuesday, December 31, 2019 8:30 AM > To: user@nutch.apache.org > Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hello Joseph, > >> Is there more documentation on having Nutch get what Tika sees into what >> Solr will see? > > No, but i believe you would want to checkout the parsechecker and > indexchecker tools. These tools display what Tika sees and what will be sent > to Solr. > > Regards, > Markus > > -Original message- >> From:Gilvary, Joseph >> Sent: Tuesday 31st December 2019 14:19 >> To: user@nutch.apache.org >> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 >> >> Happy New Year, >> >> I've searched the archives and the web as best I can, tinkered with >> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the >> parse metadata into the Solr (7.6) index. >> >> I want to index stuff like: >> >> xmp:CreatorTool=PScript5.dll Version 5.2.2 >> xmpTPg:NPages=23 >> >> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping >> out ':' for '_' isn't working for the xmp stuff. >> >> Is there more documentation on having Nutch get what Tika sees into what >> Solr will see? >> >> Any help appreciated. >> >> Thanks, >> >> Joe >>
RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
Thanks, Markus, Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss. Nutch schema.xml: nutch-site.xml: index.parse.md description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages Parsechecker sees the values for the xmp stuff: Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 access_permission:blah_blah_blah xmpTPg:NPages=23 access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z Indexchecker doesn't: fetching: http://127.0.01/test.pdf robots.txt whitelist not configured. parsing: http://127.0.01/test.pdf pdf:docinfo:title : Test File tstamp :Tue Dec 31 11:23:28 EST 2019 pdf:docinfo:modified : 2011-04-27T18:36:58Z pdf:docinfo:created : 2011-04-27T18:33:06Z The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces. Thanks, Joe -Original Message- From: Markus Jelsma Sent: Tuesday, December 31, 2019 8:30 AM To: user@nutch.apache.org Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 Hello Joseph, > Is there more documentation on having Nutch get what Tika sees into what Solr > will see? No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr. Regards, Markus -Original message- > From:Gilvary, Joseph > Sent: Tuesday 31st December 2019 14:19 > To: user@nutch.apache.org > Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Happy New Year, > > I've searched the archives and the web as best I can, tinkered with > nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the > parse metadata into the Solr (7.6) index. > > I want to index stuff like: > > xmp:CreatorTool=PScript5.dll Version 5.2.2 > xmpTPg:NPages=23 > > I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping > out ':' for '_' isn't working for the xmp stuff. > > Is there more documentation on having Nutch get what Tika sees into what Solr > will see? > > Any help appreciated. > > Thanks, > > Joe >
RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
Hello Joseph, > Is there more documentation on having Nutch get what Tika sees into what Solr > will see? No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr. Regards, Markus -Original message- > From:Gilvary, Joseph > Sent: Tuesday 31st December 2019 14:19 > To: user@nutch.apache.org > Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Happy New Year, > > I've searched the archives and the web as best I can, tinkered with > nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the > parse metadata into the Solr (7.6) index. > > I want to index stuff like: > > xmp:CreatorTool=PScript5.dll Version 5.2.2 > xmpTPg:NPages=23 > > I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping > out ':' for '_' isn't working for the xmp stuff. > > Is there more documentation on having Nutch get what Tika sees into what Solr > will see? > > Any help appreciated. > > Thanks, > > Joe >