[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 Bug 6421 depends on bug 6422, which changed state. Bug 6422 Summary: Extract embedded text from PDF documents for search https://bugzilla.wikimedia.org/show_bug.cgi?id=6422 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #12 from DrTrigon dr.tri...@surfeu.ch --- (In reply to comment #11) Not a bad idea. We have a bug filed for it somewhere? What about bug 21061, may be bug 13370 as well?! Or shall I create a new one? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #13 from Chad H. innocentkil...@gmail.com --- (In reply to comment #12) (In reply to comment #11) Not a bad idea. We have a bug filed for it somewhere? What about bug 21061, may be bug 13370 as well?! Or shall I create a new one? Those will do great :) -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 DrTrigon dr.tri...@surfeu.ch changed: What|Removed |Added See Also||https://bugzilla.wikimedia. ||org/show_bug.cgi?id=21061 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 DrTrigon dr.tri...@surfeu.ch changed: What|Removed |Added See Also||https://bugzilla.wikimedia. ||org/show_bug.cgi?id=13370 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #14 from DrTrigon dr.tri...@surfeu.ch --- Good! I linked them. Am I wrong or should we now be able to close bug 6422 as well? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 DrTrigon dr.tri...@surfeu.ch changed: What|Removed |Added See Also|https://bugzilla.wikimedia. | |org/show_bug.cgi?id=21061 | -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #10 from DrTrigon dr.tri...@surfeu.ch --- Nice! Good job - thanks! What about including metadata into the search (indexing) as well?? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #11 from Chad H. innocentkil...@gmail.com --- (In reply to comment #10) Nice! Good job - thanks! What about including metadata into the search (indexing) as well?? Not a bad idea. We have a bug filed for it somewhere? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #8 from Gerrit Notification Bot gerritad...@wikimedia.org --- Change 101252 merged by jenkins-bot: Index and search file text from pdf/djvu files https://gerrit.wikimedia.org/r/101252 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 Nik Everett neverett+bugzi...@wikimedia.org changed: What|Removed |Added Status|PATCH_TO_REVIEW |RESOLVED Resolution|--- |FIXED --- Comment #9 from Nik Everett neverett+bugzi...@wikimedia.org --- Merged. It won't take effect until a full reindex of everything in the file namespace. That'll take a few days after the deployment. Results will start showing up when the document is indexed. Also, the file text results are with .8 of a page text result from a scoring standpoint. Finally, this'll work with any files from which mediawiki is able to extract text. If a new file type is plugged in at a later date those files will have to be reindexed for the text to be searchable. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #6 from Gerrit Notification Bot gerritad...@wikimedia.org --- Change 101252 had a related patch set uploaded by Brian Wolff: Begin indexing file text from pdf/djvu files https://gerrit.wikimedia.org/r/101252 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 Gerrit Notification Bot gerritad...@wikimedia.org changed: What|Removed |Added Status|NEW |PATCH_TO_REVIEW -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #7 from Nemo federicol...@tiscali.it --- (In reply to comment #6) Change 101252 had a related patch set uploaded by Brian Wolff: Begin indexing file text from pdf/djvu files https://gerrit.wikimedia.org/r/101252 Ah, wonderful. That's in CirrusSearch and the core part was already done in https://gerrit.wikimedia.org/r/#/c/99715/ , but there's nothing more specific than the DjVu component so I'm not moving this bug. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 reza1615 reza.ene...@gmail.com changed: What|Removed |Added CC||reza.ene...@gmail.com Blocks|31552 |41037 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #5 from DrTrigon dr.tri...@surfeu.ch 2012-09-10 00:08:25 UTC --- Is it possible to include metadata into the search (indexing) in mediawiki software? As I was informed the text layer gets extracted by Pdf Handler [1] and is stored in the images (PDFs) metadata [2] (name=0). [1] https://www.mediawiki.org/wiki/Extension:PdfHandler [2] http://commons.wikimedia.org/w/api.php?action=queryiilimit=500iiprop=metadata|timestampprop=imageinfotitles=File:Resume-.pdf -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 DrTrigon dr.tri...@surfeu.ch changed: What|Removed |Added CC||dr.tri...@surfeu.ch Depends on||6422 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #3 from DrTrigon dr.tri...@surfeu.ch 2012-09-04 15:32:08 UTC --- As I can see the bug here is quite old and additionally marked low in prority. Is this bug up to be fixed at all? In my opinion to solve this bug here is a *must have*. DrTrigonBot [1] does file content based categorization in commons. Due to this embedded text from PDF (later DJVU too) is extracted and processed. We are currently debating [2] about whether to store this text data to a page - in order to enable the mediawiki search engine to index and find those contents - or not. Now the question is: When is this bug scheduled to become fixed? Will it be fixed at all? IF NOT; As mentioned DrTrigonBot could dump the files text content to a dedicated page in order to enable the mediawiki search engine to handle them. This should be considered as a work-a-round only and would not be needed at all, if and when this bug here is solved. [1] http://commons.wikimedia.org/wiki/User:DrTrigonBot [2] http://commons.wikimedia.org/wiki/User_talk:DrTrigonBot/JavaScript#PDF_content_extraction -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 --- Comment #4 from Nemo_bis federicol...@tiscali.it 2012-09-04 15:53:54 UTC --- I doubt this will be solved any time soon, there's nobody working on this or related issues and search is a monster nobody really wants to touch AFAIK, so I'd suggest you to implement whatever workaround you think it's worth about 5 more years of usage. The pages you create should probably be as hidden as possible to users, in particular they shouldn't be indexed by external search engines or they would e.g. compete with Wikisource (or even archive.org) which doesn't make any sense. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 Doug wikipediad...@googlemail.com changed: What|Removed |Added Blocks||31552 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 6421] Extract embedded text from DjVu and PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421 Doug wikipediad...@googlemail.com changed: What|Removed |Added CC||wikipediad...@googlemail.co ||m Blocks||35925 Summary|Extract embedded text from |Extract embedded text from |DjVu documents for search |DjVu and PDF documents for ||search --- Comment #2 from Doug wikipediad...@googlemail.com 2012-04-12 22:16:27 UTC --- I'm pretty sure he means that search results should include hits from within the text layer of the file scans proper. So if the scans include text Foobar a search for bar would return the text from the scan's text layer as a result even though the text layer had not been extracted and placed on a wiki page. There's no reason that this should be restricted to DjVu's (though there was at the time the bug was filed) -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l