[jira] Created: (NUTCH-807) JSParseFilter produces weired URL

2010-04-02 Thread Minyao Zhu (JIRA)
JSParseFilter produces weired URL
-

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu


This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
language site )

It appears this page contains javascripts which confused JSParseFilter, which 
produced URL like this:

http://zhidao.baidu.com/){if(A===46){baidu.hide(

Not sure the impact/scope of this issue in general.  The observation for this 
specific site is, much less pages got crawled.

Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)
Evaluate ORM Frameworks which support non-relational column-oriented datastores 
and RDBMs 
--

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar


We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler 
to compile class definitions given in JSON. Before moving on with this, we 
might benefit from evaluating other frameworks, whether they suit our needs. 

We want at least the following capabilities:
- Using POJOs 
- Able to persist objects to at least HBase, Cassandra, and RDBMs 
- Able to efficiently serialize objects as task outputs from Hadoop jobs
- Allow native queries, along with standard queries 




Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Mattmann, Chris A (388J)
(apologies for the cross-post, but this impacts Nutch 1.1, so just wanted
folks to see it)

* +1 on extending the deadline until Monday, April 5th. Right now, we have 3
+1s, so technically we could still do the 72 hrs and still be OK, but I¹m
fine with giving folks some more time to take a look
* Thanks to jzitting and gsingers for taking a look and voting so far
* Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
OK, Nutchers?
* Thanks for comments on the CHANGES from gsingers, and the mention to
include the sha1 of the src archive from jzitting. Will do on both, going
forward. 
* +1 for having a direct link to tika-app on the website.

Cheers,
Chris




On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:

 Hi,
 
 On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Please vote on releasing these packages as Apache Tika 0.7.
 
 +1 Thanks!
 
 Some minor notes:
 * It would be good to have also a SHA1 checksum for the release archive.
 * Perhaps we should start offering also the tika-app jar as a direct
 download from l.a.o/tika/download.html?
 
 The vote is open for the next 72 hours.
 
 It looks like people.apache.org is not accessible at the moment (I
 downloaded the release candidate yesterday), so it might be a good
 idea to extend the vote period over the Easter holidays.
 
 BR,
 
 Jukka Zitting
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Julien Nioche
Hi Chris,


 * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
 OK, Nutchers?


Great. I'll definitely give 0.7 a try and make sure it works in Nutch.

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com




 On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:

  Hi,
 
  On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J)
  chris.a.mattm...@jpl.nasa.gov wrote:
  Please vote on releasing these packages as Apache Tika 0.7.
 
  +1 Thanks!
 
  Some minor notes:
  * It would be good to have also a SHA1 checksum for the release archive.
  * Perhaps we should start offering also the tika-app jar as a direct
  download from l.a.o/tika/download.html?
 
  The vote is open for the next 72 hours.
 
  It looks like people.apache.org is not accessible at the moment (I
  downloaded the release candidate yesterday), so it might be a good
  idea to extend the vote period over the Easter holidays.
 
  BR,
 
  Jukka Zitting
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche

 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840
 ] 

Enis Soztutar commented on NUTCH-808:
-

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro 
serialization so that objects can be passed from Mapred. 


 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Andrzej Bialecki
On 2010-04-02 16:14, Mattmann, Chris A (388J) wrote:

 * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
 OK, Nutchers?

Yes - thanks!


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Mattmann, Chris A (388J)
Hey Jukka,

Sounds good to me then if no one else objects.

I'll wait the 72 hrs (Sat, 4:01 PM EST) and then assuming the VOTE passes, roll 
the releases out to the mirrors and then work on Nutch 1.1.

Cheers,
Chris



On 4/2/10 11:41 AM, Jukka Zitting jukka.zitt...@gmail.com wrote:

Hi,

On Fri, Apr 2, 2010 at 4:14 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 +1s, so technically we could still do the 72 hrs and still be OK, but I'm
 fine with giving folks some more time to take a look

I'm fine with closing the vote already at 72 hours since the p.a.o
outage only seemed to last a few hours.

Jukka



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++