[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: (was: udf_cosine_similarity-v01.patch)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, HIVE-9557.2.patch, HIVE-9557.3.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-11137:
--
Assignee: Owen O'Malley  (was: Nishant Kelkar)

 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: HIVE-11137.1.patch


 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-11137:
--
Attachment: (was: HIVE-11137.1.patch)

 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: (was: HIVE-9557.1.patch)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Alexander Pivovarov
  Labels: CosineSimilarity, SimilarityMetric, UDF

 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: (was: HIVE-9557.3.patch)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Alexander Pivovarov
  Labels: CosineSimilarity, SimilarityMetric, UDF

 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-07-06 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: (was: HIVE-9557.2.patch)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Alexander Pivovarov
  Labels: CosineSimilarity, SimilarityMetric, UDF

 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-07-01 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610813#comment-14610813
 ] 

Nishant Kelkar commented on HIVE-11137:
---

Is this an unrelated test failure?

 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Nishant Kelkar
 Attachments: HIVE-11137.1.patch


 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-07-01 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610816#comment-14610816
 ] 

Nishant Kelkar commented on HIVE-11137:
---

From the hive.log, I see the following two issues:

{code}
2015-07-01 11:13:28,877 ERROR [Thread-17]: thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(101)) - Error starting HiveServer2: could not 
start ThriftBinaryCLIService
org.apache.thrift.transport.TTransportException: Could not create ServerSocket 
on address 0.0.0.0/0.0.0.0:1.
at 
org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:109)
at 
org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:91)
at 
org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:87)
at 
org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:241)
at 
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:66)
at java.lang.Thread.run(Thread.java:744)
{code}

and 

{code}
2015-07-01 11:13:18,009 DEBUG [main]: util.Shell 
(Shell.java:checkHadoopHome(320)) - Failed to detect a valid hadoop home 
directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)
at org.apache.hadoop.util.Shell.clinit(Shell.java:327)
at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2375)
at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.clinit(HiveConf.java:366)
at org.apache.hadoop.hive.conf.HiveConf.clinit(HiveConf.java:105)
at 
org.apache.hive.service.auth.TestCustomAuthentication.setUp(TestCustomAuthentication.java:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
{code}

 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Nishant Kelkar
 Attachments: HIVE-11137.1.patch


 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-07-01 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609881#comment-14609881
 ] 

Nishant Kelkar commented on HIVE-11137:
---

BTW, let me know if submitting patch != taking ownership of task in general. 
That way, I can hand it back to you (still learning all the rules here). Thank 
you!

 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Nishant Kelkar
 Attachments: HIVE-11137.1.patch


 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils

2015-06-29 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605940#comment-14605940
 ] 

Nishant Kelkar commented on HIVE-11137:
---

LazyBinaryUtils used only for readVInt() and writeVInt(). Relevant sections of 
code from LazyBinaryUtils:

{code}
  private static ThreadLocalbyte[] vLongBytesThreadLocal = new 
ThreadLocalbyte[]() {
@Override
public byte[] initialValue() {
  return new byte[9];
}
  };

  public static void writeVLong(RandomAccessOutput byteStream, long l) {
byte[] vLongBytes = vLongBytesThreadLocal.get();
int len = LazyBinaryUtils.writeVLongToByteArray(vLongBytes, l);
byteStream.write(vLongBytes, 0, len);
  }
{code}

{code}
  /**
   * Reads a zero-compressed encoded int from a byte array and returns it.
   *
   * @param bytes
   *  the byte array
   * @param offset
   *  offset of the array to read from
   * @param vInt
   *  storing the deserialized int and its size in byte
   */
  public static void readVInt(byte[] bytes, int offset, VInt vInt) {
byte firstByte = bytes[offset];
vInt.length = (byte) WritableUtils.decodeVIntSize(firstByte);
if (vInt.length == 1) {
  vInt.value = firstByte;
  return;
}
int i = 0;
for (int idx = 0; idx  vInt.length - 1; idx++) {
  byte b = bytes[offset + 1 + idx];
  i = i  8;
  i = i | (b  0xFF);
}
vInt.value = (WritableUtils.isNegativeVInt(firstByte) ? (i ^ -1) : i);
  }
{code}

I could contribute a patch towards this task [~owen.omalley] (I'm a beginner 
contributor in Hive, looking around for work :)). Thanks and let me know!


 In DateWritable remove the use of LazyBinaryUtils
 -

 Key: HIVE-11137
 URL: https://issues.apache.org/jira/browse/HIVE-11137
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 Currently the DateWritable class uses LazyBinaryUtils, which has a lot of 
 dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-29 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: HIVE-9557.3.patch

Attaching revision #3 patch to remove hidden dependency on FastMath (it comes 
in via org.apache.spark:spark-core_2.10 dependency) from commons-math3. Using 
library Math instead.


 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, HIVE-9557.2.patch, HIVE-9557.3.patch, 
 udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-28 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: HIVE-9557.2.patch

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, HIVE-9557.2.patch, 
 udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-27 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: HIVE-9557.1.patch

Attached first revision on cosine similarity UDF.

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-27 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604068#comment-14604068
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Figured out the issue. Made a dummy var. HADOOP_HOME point to HIVE_HOME. Also, 
removed commented out queries from the udf_cosine_similarity.q clientpositive 
file. I'll upload a patch with an RB link soon.

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-27 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604349#comment-14604349
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Done. Could you please test for access now?

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-27 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604335#comment-14604335
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Hey Alexander,
Hmmm, in the review settings, I've added the group 'hive' and the user 
'apivovarov'. 

I used rbt to create and upload the ticket to the Apache server.

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: HIVE-9557.1.patch, udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-26 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603987#comment-14603987
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Hi [~apivovarov],
I followed your instructions, and everything went fine till the step where I 
run the TestCliDriver with 'mvn test'. I get the following exception in 
./itests/qtest/tmp/log/hive.log:

{code}
2015-06-26 22:25:47,656 DEBUG [main]: util.Shell 
(Shell.java:checkHadoopHome(320)) - Failed to detect a valid hadoop home 
directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)
at org.apache.hadoop.util.Shell.clinit(Shell.java:327)
at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2371)
at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.clinit(HiveConf.java:366)
at org.apache.hadoop.hive.conf.HiveConf.clinit(HiveConf.java:105)
at org.apache.hadoop.hive.ql.QTestUtil.init(QTestUtil.java:354)
at 
org.apache.hadoop.hive.cli.TestCliDriver.clinit(TestCliDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.internal.runners.SuiteMethod.testFromSuiteMethod(SuiteMethod.java:35)
at org.junit.internal.runners.SuiteMethod.init(SuiteMethod.java:24)
at 
org.junit.internal.builders.SuiteMethodBuilder.runnerForClass(SuiteMethodBuilder.java:11)
at 
org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at 
org.junit.internal.builders.AllDefaultPossibilitiesBuilder.runnerForClass(AllDefaultPossibilitiesBuilder.java:26)
at 
org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at 
org.junit.internal.requests.ClassRequest.getRunner(ClassRequest.java:26)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:262)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
2015-06-26 22:25:47,669 DEBUG [main]: util.Shell 
(Shell.java:isSetsidSupported(392)) - setsid is not available on this machine. 
So not using it.
2015-06-26 22:25:47,669 DEBUG [main]: util.Shell 
(Shell.java:isSetsidSupported(396)) - setsid exited with exit code 0
2015-06-26 22:25:48,408 WARN  [main]: conf.HiveConf 
(HiveConf.java:initialize(2802)) - HiveConf of name 
hive.dummyparam.test.server.specific.config.metastoresite does not exist
2015-06-26 22:25:48,409 WARN  [main]: conf.HiveConf 
(HiveConf.java:initialize(2802)) - HiveConf of name 
hive.ql.log.PerfLogger.level does not exist
2015-06-26 22:25:48,409 WARN  [main]: conf.HiveConf 
(HiveConf.java:initialize(2802)) - HiveConf of name 
hive.dummyparam.test.server.specific.config.hivesite does not exist
2015-06-26 22:25:48,409 WARN  [main]: conf.HiveConf 
(HiveConf.java:initialize(2802)) - HiveConf of name 
hive.dummyparam.test.server.specific.config.override does not exist
2015-06-26 22:25:48,410 WARN  [main]: conf.HiveConf 
(HiveConf.java:initialize(2802)) - HiveConf of name hive.metastore.metadb.dir 
does not exist
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server 
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server 
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server environment:host.name=localhost
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server environment:host.name=localhost
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server environment:java.version=1.7.0_67
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server environment:java.version=1.7.0_67
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - Server environment:java.vendor=Oracle 
Corporation
2015-06-26 22:25:48,477 INFO  [main]: server.ZooKeeperServer 
(Environment.java:logEnv(100)) - 

[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-26 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603988#comment-14603988
 ] 

Nishant Kelkar commented on HIVE-9557:
--

The TestCliDriver tests actually fail with the following error:

{code}
---
 T E S T S
---
Running org.apache.hadoop.hive.cli.TestCliDriver
Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 71.797 sec  
FAILURE! - in org.apache.hadoop.hive.cli.TestCliDriver
testCliDriver_udf_cosine_similarity(org.apache.hadoop.hive.cli.TestCliDriver)  
Time elapsed: 0.346 sec   FAILURE!
junit.framework.AssertionFailedError: Unexpected exception 
junit.framework.AssertionFailedError: Client Execution failed with error code = 
10014 running 

select
cosine_similarity('kitten', 'sitting', ' '),
cosine_similarity('sitting kitten', 'kitten sitting', ' '),
cosine_similarity('sitting kitten', 'sitting kittens', ' '),
cosine_similarity('two#delimiters,here', 'two#delimiters#,here,too', '#,'),
cosine_similarity('test string', '', ' '),
cosine_similarity(cast(null as string), 'test string', ' '),
cosine_similarity('test string', cast(null as string), ','),
cosine_similarity(cast(null as string), cast(null as string), ' '),
cosine_similarity('a string', 'another string', '')
See ./ql/target/tmp/log/hive.log or ./itests/qtest/target/tmp/log/hive.log, or 
check ./ql/target/surefire-reports or ./itests/qtest/target/surefire-reports/ 
for specific test cases logs.
at junit.framework.Assert.fail(Assert.java:57)
at org.apache.hadoop.hive.ql.QTestUtil.failed(QTestUtil.java:1984)
at 
org.apache.hadoop.hive.cli.TestCliDriver.runTest(TestCliDriver.java:152)
at 
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf_cosine_similarity(TestCliDriver.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at junit.framework.TestCase.runTest(TestCase.java:176)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
at junit.framework.TestSuite.runTest(TestSuite.java:255)
at junit.framework.TestSuite.run(TestSuite.java:250)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
{code}

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11114) Documentation of Pentaho Missing from Maven Central

2015-06-25 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601720#comment-14601720
 ] 

Nishant Kelkar commented on HIVE-4:
---

[~leftylev] tagging you here for more help/info.

 Documentation of Pentaho Missing from Maven Central
 ---

 Key: HIVE-4
 URL: https://issues.apache.org/jira/browse/HIVE-4
 Project: Hive
  Issue Type: Task
Reporter: Nishant Kelkar
Assignee: Nishant Kelkar
Priority: Minor

 I recently cloned the Hive Git repository. When I went into the hive/ql 
 sub-project and issued the command 'mvn clean compile -Phadoop-1', I got the 
 following build error:
 [ERROR] Failed to execute goal on project hive-exec: Could not resolve 
 dependencies for project org.apache.hive:hive-exec:jar:2.0.0-SNAPSHOT: Could 
 not find artifact org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde 
 in US (http://repo.maven.apache.org/maven2) - [Help 1]
 This is because the pentaho-aggdesigner-algorithm dependency is not supported 
 by Maven central; however, it is supported by Conjars.
 As a quick fix, I downloaded the jar from Conjars repo, and manually 
 installed this dependency to my local Maven by following the instructions 
 here: 
 http://www.mkyong.com/maven/how-to-include-library-manully-into-maven-local-repository/
 However, I feel this dependency should be supported on Maven central (I'm not 
 sure where to create this ticket/whom with, but Hive is my use case, so any 
 pointers greatly appreciated).
 This ticket tracks the task of documenting this fact on the Hive wiki as an 
 additional Note.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598936#comment-14598936
 ] 

Nishant Kelkar commented on HIVE-9557:
--

[~apivovarov]: The reference implementation link you've provided seems to be 
broken. Did you mean to point here? -- 
https://github.com/Simmetrics/simmetrics/blob/master/simmetrics-core/src/main/java/org/simmetrics/metrics/CosineSimilarity.java

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Alexander Pivovarov

 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11091) Unable to load data into hive table using Load data local inapth command from unix named pipe

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599283#comment-14599283
 ] 

Nishant Kelkar commented on HIVE-11091:
---

I did a diff of Hive 0.11 vs. Hive 0.14 for the piece of code within MoveTask 
that is causing this error:

Hive-0.11:
{code}
Table table = db.getTable(tbd.getTable().getTableName());

if (work.getCheckFileFormat()) {
  // Get all files from the src directory
  FileStatus[] dirs;
  ArrayListFileStatus files;
  FileSystem fs;
  try {
fs = FileSystem.get(table.getDataLocation(), conf);
dirs = fs.globStatus(new Path(tbd.getSourceDir()));
files = new ArrayListFileStatus();
for (int i = 0; (dirs != null  i  dirs.length); i++) {
  files.addAll(Arrays.asList(fs.listStatus(dirs[i].getPath(;
  // We only check one file, so exit the loop when we have at least
  // one.
  if (files.size()  0) {
break;
  }
}
  } catch (IOException e) {
throw new HiveException(
addFiles: filesystem error in check phase, e);
  }
  if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVECHECKFILEFORMAT)) 
{
// Check if the file format of the file matches that of the table.
boolean flag = HiveFileFormatUtils.checkInputFormat(
fs, conf, tbd.getTable().getInputFileFormatClass(), files);
if (!flag) {
  throw new HiveException(
  Wrong file format. Please check the file's format.);
}
  }
}
{code}

Hive-0.14:
{code}
Table table = db.getTable(tbd.getTable().getTableName());

if (work.getCheckFileFormat()) {
  // Get all files from the src directory
  FileStatus[] dirs;
  ArrayListFileStatus files;
  FileSystem srcFs; // source filesystem
  try {
srcFs = tbd.getSourcePath().getFileSystem(conf);
dirs = srcFs.globStatus(tbd.getSourcePath());
files = new ArrayListFileStatus();
for (int i = 0; (dirs != null  i  dirs.length); i++) {
  files.addAll(Arrays.asList(srcFs.listStatus(dirs[i].getPath(), 
FileUtils.HIDDEN_FILES_PATH_FILTER)));
  // We only check one file, so exit the loop when we have at least
  // one.
  if (files.size()  0) {
break;
  }
}
  } catch (IOException e) {
throw new HiveException(
addFiles: filesystem error in check phase, e);
  }
  if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVECHECKFILEFORMAT)) 
{
// Check if the file format of the file matches that of the table.
boolean flag = HiveFileFormatUtils.checkInputFormat(
srcFs, conf, tbd.getTable().getInputFileFormatClass(), files);
if (!flag) {
  throw new HiveException(
  Wrong file format. Please check the file's format.);
}
  }
}
{code}

 Unable to load data into hive table using Load data local inapth command 
 from unix named pipe
 ---

 Key: HIVE-11091
 URL: https://issues.apache.org/jira/browse/HIVE-11091
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 0.14.0
 Environment: Unix,MacOS
Reporter: Manoranjan Sahoo
Priority: Blocker

 Unable to load data into hive table from unix named pipe in Hive 0.14.0 
 Please find below the execution details in env ( Hadoop2.6.0 + Hive 0.14.0):
 
 $ mkfifo /tmp/test.txt
 $ hive
 hive create table test(id bigint,name string);
 OK
 Time taken: 1.018 seconds
 hive LOAD DATA LOCAL INPATH '/tmp/test.txt' OVERWRITE INTO TABLE test;
 Loading data to table default.test
 Failed with exception addFiles: filesystem error in check phase
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.MoveTask
 But in Hadoop 1.3 and hive 0.11.0  it works fine:
 hive LOAD DATA LOCAL INPATH '/tmp/test.txt' OVERWRITE INTO TABLE test;
 Copying data from file:/tmp/test.txt
 Copying file: file:/tmp/test.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar updated HIVE-9557:
-
Attachment: udf_cosine_similarity-v01.patch

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599265#comment-14599265
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Hey [~kinow] and [~apivovarov], I've added a patch for the cosine similarity 
metric UDF and some test cases. This is my first time submitting a patch, so I 
guess I'm allowed 1 chance at the following question? :)

What are all the next steps in this process, once a patch has been uploaded?

I could also add this correspondence in an email to d...@hive.apache.org, for 
everyone else's benefit. 

Thanks!

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11091) Unable to load data into hive table using Load data local inapth command from unix named pipe

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599301#comment-14599301
 ] 

Nishant Kelkar commented on HIVE-11091:
---

The only significant change I see in above code snippets is:

{code}
srcFs = tbd.getSourcePath().getFileSystem(conf);
dirs = srcFs.globStatus(tbd.getSourcePath());
{code}

i.e. the way in which we get the file system handle and a list of the 
directories/files within the path provided. 

 Unable to load data into hive table using Load data local inapth command 
 from unix named pipe
 ---

 Key: HIVE-11091
 URL: https://issues.apache.org/jira/browse/HIVE-11091
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 0.14.0
 Environment: Unix,MacOS
Reporter: Manoranjan Sahoo
Priority: Blocker

 Unable to load data into hive table from unix named pipe in Hive 0.14.0 
 Please find below the execution details in env ( Hadoop2.6.0 + Hive 0.14.0):
 
 $ mkfifo /tmp/test.txt
 $ hive
 hive create table test(id bigint,name string);
 OK
 Time taken: 1.018 seconds
 hive LOAD DATA LOCAL INPATH '/tmp/test.txt' OVERWRITE INTO TABLE test;
 Loading data to table default.test
 Failed with exception addFiles: filesystem error in check phase
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.MoveTask
 But in Hadoop 1.3 and hive 0.11.0  it works fine:
 hive LOAD DATA LOCAL INPATH '/tmp/test.txt' OVERWRITE INTO TABLE test;
 Copying data from file:/tmp/test.txt
 Copying file: file:/tmp/test.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishant Kelkar reassigned HIVE-9557:


Assignee: Nishant Kelkar  (was: Alexander Pivovarov)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar

 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600145#comment-14600145
 ] 

Nishant Kelkar commented on HIVE-9557:
--

[~apivovarov], I had a question: When I prepare a 
clientpositives/udf_cosine_similarity.q and a 
clientnegative/udf_cosine_similarity.q, how do I run these? Also, how do I 
create the q.out file?



 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9557) create UDF to measure strings similarity using Cosine Similarity algo

2015-06-24 Thread Nishant Kelkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599747#comment-14599747
 ] 

Nishant Kelkar commented on HIVE-9557:
--

Thanks for the pointers! I'll modify the patch per your instructions and 
reupload.

Thanks for working with me through my first patch! :)

 create UDF to measure strings similarity using Cosine Similarity algo
 -

 Key: HIVE-9557
 URL: https://issues.apache.org/jira/browse/HIVE-9557
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Reporter: Alexander Pivovarov
Assignee: Nishant Kelkar
  Labels: CosineSimilarity, SimilarityMetric, UDF
 Attachments: udf_cosine_similarity-v01.patch


 algo description http://en.wikipedia.org/wiki/Cosine_similarity
 {code}
 --one word different, total 2 words
 str_sim_cosine('Test String1', 'Test String2') = (2 - 1) / 2 = 0.5f
 {code}
 reference implementation:
 https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/CosineSimilarity.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)